[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290857#comment-16290857 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4260: [BEAM-3060] temporary reshuffle for AvroIOIT and TFRecordIOIT URL: https://github.com/apache/beam/pull/4260 This is extension of #4210 where reshuffling was added only to `TextIOIT`. Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16290896#comment-16290896 ] ASF GitHub Bot commented on BEAM-3060: -- szewi opened a new pull request #4261: [BEAM-3060] HDFS cluster configuration, kubernetes scripts, filebased io support … URL: https://github.com/apache/beam/pull/4261 …for hdfs tests. Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16292714#comment-16292714 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4267: [BEAM-3060] job for performance tests of file-based IOs URL: https://github.com/apache/beam/pull/4267 This PR adds Jenkins job that will run currently available performance tests of file-based IOs on dataflow via PerfKit. - Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16294500#comment-16294500 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4260: [BEAM-3060] temporary reshuffle for AvroIOIT and TFRecordIOIT URL: https://github.com/apache/beam/pull/4260 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java index ce8da3357c9..be0d6df2eb7 100644 --- a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java @@ -35,6 +35,7 @@ import org.apache.beam.sdk.transforms.Combine; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.Reshuffle; import org.apache.beam.sdk.transforms.Values; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.values.PCollection; @@ -102,7 +103,8 @@ public void writeThenReadAll() { "Write Avro records to files", AvroIO.writeGenericRecords(AVRO_SCHEMA).to(filenamePrefix) .withOutputFilenames().withSuffix(".avro")) -.getPerDestinationOutputFilenames().apply(Values.create()); +.getPerDestinationOutputFilenames().apply(Values.create()) +.apply(Reshuffle.viaRandomKey()); PCollection consolidatedHashcode = testFilenames .apply("Read all files", AvroIO.readAllGenericRecords(AVRO_SCHEMA)) diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java index b887316b187..3f08d76750c 100644 --- a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java @@ -36,6 +36,7 @@ import org.apache.beam.sdk.transforms.Create; import org.apache.beam.sdk.transforms.MapElements; import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.Reshuffle; import org.apache.beam.sdk.transforms.SimpleFunction; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.values.PCollection; @@ -110,7 +111,8 @@ public void writeThenReadAll() { PCollection consolidatedHashcode = readPipeline .apply(TFRecordIO.read().from(filenamePattern).withCompression(AUTO)) .apply("Transform bytes to strings", MapElements.via(new ByteArrayToString())) -.apply("Calculate hashcode", Combine.globally(new HashingFn())); +.apply("Calculate hashcode", Combine.globally(new HashingFn())) +.apply(Reshuffle.viaRandomKey()); String expectedHash = getExpectedHashForLineCount(numberOfTextLines); PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash); This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish be
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16297689#comment-16297689 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4267: [BEAM-3060] job for performance tests of file-based IOs URL: https://github.com/apache/beam/pull/4267 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy new file mode 100644 index 000..0cee3d88527 --- /dev/null +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -0,0 +1,75 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +import common_job_properties + +// This job runs the file-based IOs performance tests on PerfKit Benchmarker. +job('beam_PerformanceTests_FileBasedIO_IT') { +description('Runs PerfKit tests for file-based IOs.') + +// Set default Beam job properties. +common_job_properties.setTopLevelMainJobProperties(delegate) + +// Allows triggering this build against pull requests. +common_job_properties.enablePhraseTriggeringFromPullRequest( +delegate, +'Java FileBasedIOs Performance Test', +'Run Java FileBasedIOs Performance Test') + +// Run job in postcommit every 6 hours, don't trigger every push, and +// don't email individual committers. +common_job_properties.setPostCommit( +delegate, +'0 */6 * * *', +false, +'commits@beam.apache.org', +false) + +def pipelineArgs = [ +tempRoot: 'gs://temp-storage-for-perf-tests', +project: 'apache-beam-io-testing', +numberOfRecords: '100', +filenamePrefix: 'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT', +] +def pipelineArgList = [] +pipelineArgs.each({ +key, value -> pipelineArgList.add("\"--$key=$value\"") +}) +def pipelineArgsJoined = "[" + pipelineArgList.join(',') + "]" + + +def itClasses = [ +"org.apache.beam.sdk.io.text.TextIOIT", +"org.apache.beam.sdk.io.avro.AvroIOIT", +"org.apache.beam.sdk.io.tfrecord.TFRecordIOIT", +] + +itClasses.each { +def argMap = [ +benchmarks: 'beam_integration_benchmark', +beam_it_profile: 'io-it', +beam_prebuilt: 'true', +beam_sdk: 'java', +beam_it_module: 'sdks/java/io/file-based-io-tests', +beam_it_class: "${it}", +beam_it_options: pipelineArgsJoined, +beam_extra_mvn_properties: '["filesystem=gcs"]', +] +common_job_properties.buildPerformanceTest(delegate, argMap) +} +} This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking i
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298142#comment-16298142 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4296: [BEAM-3060] FIX: remove overriding Google project in file-based IOs performance tests URL: https://github.com/apache/beam/pull/4296 In #4267 I've accidentally committed own google project name therefore tests are failing on Jenkins. This change removes it so test relies on default one. Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16298717#comment-16298717 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4296: [BEAM-3060] FIX: remove overriding Google project in file-based IOs performance tests URL: https://github.com/apache/beam/pull/4296 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy index 0cee3d88527..fc07e2e11e5 100644 --- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -42,7 +42,6 @@ job('beam_PerformanceTests_FileBasedIO_IT') { def pipelineArgs = [ tempRoot: 'gs://temp-storage-for-perf-tests', -project: 'apache-beam-io-testing', numberOfRecords: '100', filenamePrefix: 'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT', ] This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16299694#comment-16299694 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4304: [BEAM-3060] explicitly use Apache's Google project for file-based performance tests URL: https://github.com/apache/beam/pull/4304 it turns out project must be also defined in pipeline options. I hope this is last PR of the series of configuring this Jenkins job ;) Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300226#comment-16300226 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4305: [BEAM-3060] Increase timeout for FileBasedIOIT... URL: https://github.com/apache/beam/pull/4305 ...to 20 mins by default and option to override. In PerfKit there is default timeout set to 10 mins. Large-scale tests run via PerfKit are failing. this PR resolves it. -- Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16300248#comment-16300248 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4304: [BEAM-3060] explicitly use Apache's Google project for file-based performance tests URL: https://github.com/apache/beam/pull/4304 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy index fc07e2e11e5..f24b93238e4 100644 --- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -41,6 +41,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') { false) def pipelineArgs = [ +project: 'apache-beam-testing', tempRoot: 'gs://temp-storage-for-perf-tests', numberOfRecords: '100', filenamePrefix: 'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT', This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301316#comment-16301316 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4318: [BEAM-3060] Enable large-scale test for FileBasedIOIT URL: https://github.com/apache/beam/pull/4318 Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16301314#comment-16301314 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4317: [BEAM-3060] Use dedicated BigQuery table for performance tests of FileBasedIOIT URL: https://github.com/apache/beam/pull/4317 Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307368#comment-16307368 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4318: [BEAM-3060] Enable large-scale test for FileBasedIOIT URL: https://github.com/apache/beam/pull/4318 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy index f24b93238e4..99bab5e10d7 100644 --- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -43,7 +43,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') { def pipelineArgs = [ project: 'apache-beam-testing', tempRoot: 'gs://temp-storage-for-perf-tests', -numberOfRecords: '100', +numberOfRecords: '1', filenamePrefix: 'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT', ] def pipelineArgList = [] @@ -62,6 +62,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') { itClasses.each { def argMap = [ benchmarks: 'beam_integration_benchmark', +beam_it_timeout: '1200', beam_it_profile: 'io-it', beam_prebuilt: 'true', beam_sdk: 'java', This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16307507#comment-16307507 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4317: [BEAM-3060] Use dedicated BigQuery table for performance tests of FileBasedIOIT URL: https://github.com/apache/beam/pull/4317 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy index f24b93238e4..228e75b2c98 100644 --- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -69,6 +69,7 @@ job('beam_PerformanceTests_FileBasedIO_IT') { beam_it_class: "${it}", beam_it_options: pipelineArgsJoined, beam_extra_mvn_properties: '["filesystem=gcs"]', +bigquery_table: 'beam_performance.filebasedioit_pkb_results', ] common_job_properties.buildPerformanceTest(delegate, argMap) } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318193#comment-16318193 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski closed pull request #4305: [BEAM-3060] Allow to specify timeout for FileBasedIOIT ran via PerfKit URL: https://github.com/apache/beam/pull/4305 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index 44119ec79ff..4de2e70615f 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -113,6 +113,7 @@ ${pkbLocation} -benchmarks=beam_integration_benchmark -beam_it_profile=io-it + -beam_it_timeout=${pkbTimeout} -beam_location=${beamRootProjectDir} -beam_prebuilt=true -beam_sdk=java diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml index 07e1b5cb9ff..0710df05d89 100644 --- a/sdks/java/io/pom.xml +++ b/sdks/java/io/pom.xml @@ -38,6 +38,7 @@ +600 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16318194#comment-16318194 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4305: [BEAM-3060] Allow to specify timeout for FileBasedIOIT ran via PerfKit URL: https://github.com/apache/beam/pull/4305 with default set to 10 mins (which is PerfKit's timeout). Background: large-scale tests run via PerfKit were failing. this PR allows to specify timeout so tests are passing. -- Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16320055#comment-16320055 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4378: [BEAM-3060] split one job into several jobs, one for each IO. URL: https://github.com/apache/beam/pull/4378 Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16322778#comment-16322778 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4378: [BEAM-3060] split one job into several jobs, one for each IO. URL: https://github.com/apache/beam/pull/4378 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy index b41af717168..667b11d2072 100644 --- a/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy +++ b/.test-infra/jenkins/job_beam_PerformanceTests_FileBasedIO_IT.groovy @@ -18,60 +18,102 @@ import common_job_properties -// This job runs the file-based IOs performance tests on PerfKit Benchmarker. -job('beam_PerformanceTests_FileBasedIO_IT') { -description('Runs PerfKit tests for file-based IOs.') +def testsConfigurations = [ +[ +jobName : 'beam_PerformanceTests_TextIOIT', +jobDescription: 'Runs PerfKit tests for TextIOIT', +itClass : 'org.apache.beam.sdk.io.text.TextIOIT', +bqTable : 'beam_performance.textioit_pkb_results', +prCommitStatusName: 'Java TextIO Performance Test', +prTriggerPhase: 'Run Java TextIO Performance Test', -// Set default Beam job properties. -common_job_properties.setTopLevelMainJobProperties(delegate) +], +[ +jobName: 'beam_PerformanceTests_Compressed_TextIOIT', +jobDescription : 'Runs PerfKit tests for TextIOIT with GZIP compression', +itClass: 'org.apache.beam.sdk.io.text.TextIOIT', +bqTable: 'beam_performance.compressed_textioit_pkb_results', +prCommitStatusName : 'Java CompressedTextIO Performance Test', +prTriggerPhase : 'Run Java CompressedTextIO Performance Test', +extraPipelineArgs: [ +compressionType: 'GZIP' +] +], +[ +jobName : 'beam_PerformanceTests_AvroIOIT', +jobDescription: 'Runs PerfKit tests for AvroIOIT', +itClass : 'org.apache.beam.sdk.io.avro.AvroIOIT', +bqTable : 'beam_performance.avroioit_pkb_results', +prCommitStatusName: 'Java AvroIO Performance Test', +prTriggerPhase: 'Run Java AvroIO Performance Test', +], +[ +jobName : 'beam_PerformanceTests_TFRecordIOIT', +jobDescription: 'Runs PerfKit tests for beam_PerformanceTests_TFRecordIOIT', +itClass : 'org.apache.beam.sdk.io.tfrecord.TFRecordIOIT', +bqTable : 'beam_performance.tfrecordioit_pkb_results', +prCommitStatusName: 'Java TFRecordIO Performance Test', +prTriggerPhase: 'Run Java TFRecordIO Performance Test', +], +] + +for (testConfiguration in testsConfigurations) { +create_filebasedio_performance_test_job(testConfiguration) +} -// Allows triggering this build against pull requests. -common_job_properties.enablePhraseTriggeringFromPullRequest( -delegate, -'Java FileBasedIOs Performance Test', -'Run Java FileBasedIOs Performance Test') -// Run job in postcommit every 6 hours, don't trigger every push, and -// don't email individual committers. -common_job_properties.setPostCommit( -delegate, -'0 */6 * * *', -false, -'commits@beam.apache.org', -false) +private void create_filebasedio_performance_test_job(testConfiguration) { -def pipelineArgs = [ -project: 'apache-beam-testing', -tempRoot: 'gs://temp-storage-for-perf-tests', -numberOfRecords: '100', -filenamePrefix: 'gs://temp-storage-for-perf-tests/filebased/${BUILD_ID}/TESTIOIT', -] -def pipelineArgList = [] -pipelineArgs.each({ -key, value -> pipelineArgList.add("\"--$key=$value\"") -}) -def pipelineArgsJoined = "[" + pipelineArgList.join(',') + "]" +// This job runs the file-based IOs performance tests on PerfKit Benchmarker. +job(testConfiguration.jobName) { +description(testConfiguration.jobDescription) +// Set default Beam job properties. +common_job_properties.setTopLevelMainJobProperties(delegate) -
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323337#comment-16323337 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4305: [BEAM-3060] Allow to specify timeout for FileBasedIOIT ran via PerfKit URL: https://github.com/apache/beam/pull/4305 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index 44119ec79ff..4de2e70615f 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -113,6 +113,7 @@ ${pkbLocation} -benchmarks=beam_integration_benchmark -beam_it_profile=io-it + -beam_it_timeout=${pkbTimeout} -beam_location=${beamRootProjectDir} -beam_prebuilt=true -beam_sdk=java diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml index 07e1b5cb9ff..0710df05d89 100644 --- a/sdks/java/io/pom.xml +++ b/sdks/java/io/pom.xml @@ -38,6 +38,7 @@ +600 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16323377#comment-16323377 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4261: [BEAM-3060] HDFS cluster configuration, kubernetes scripts, filebased io support … URL: https://github.com/apache/beam/pull/4261 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml new file mode 100644 index 000..b761137f35b --- /dev/null +++ b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster-for-local-dev.yml @@ -0,0 +1,46 @@ +#Licensed to the Apache Software Foundation (ASF) under one or more +#contributor license agreements. See the NOTICE file distributed with +#this work for additional information regarding copyright ownership. +#The ASF licenses this file to You under the Apache License, Version 2.0 +#(the "License"); you may not use this file except in compliance with +#the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +# +# This script creates hadoop-external service that allows to connect to hdfs cluster from +# outside world. Running: +# +# kubectl get svc hadoop-external +# +# allows to read LoadBalancer EXTERNAL-IP which should be used to interact with the hdfs cluster. +# + +apiVersion: v1 +kind: Service +metadata: + name: hadoop-external + labels: +name: hadoop-external +spec: + ports: +- name: sshd + port: 2122 +- name: hdfs + port: 9000 +- name: web + port: 50070 +- name: datanode + port: 50010 +- name: datanode-icp + port: 50020 +- name: datanode-http + port: 50075 + selector: +name: hadoop + type: LoadBalancer diff --git a/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml new file mode 100644 index 000..483c2961f4f --- /dev/null +++ b/.test-infra/kubernetes/hadoop/SmallITCluster/hdfs-single-datanode-cluster.yml @@ -0,0 +1,83 @@ +#Licensed to the Apache Software Foundation (ASF) under one or more +#contributor license agreements. See the NOTICE file distributed with +#this work for additional information regarding copyright ownership. +#The ASF licenses this file to You under the Apache License, Version 2.0 +#(the "License"); you may not use this file except in compliance with +#the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +#Unless required by applicable law or agreed to in writing, software +#distributed under the License is distributed on an "AS IS" BASIS, +#WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +#See the License for the specific language governing permissions and +#limitations under the License. +# +# This script contains definition of hdfs single node cluster. In this configuration hdfs datanode +# and namenode are running on the same pod. Service hadoop allows to connect to pods labeled as +# hadoop, this service also provides connectivity from outside of the cluster. +# Replication controller creates pods using docker image sequenceiq/hadoop-docker:2.7.1. +# Each pod created will expose hdfs standard ports. +# + +apiVersion: v1 +kind: Service +metadata: + name: hadoop + labels: +name: hadoop +spec: + ports: +- name: sshd + port: 2122 +- name: hdfs + port: 9000 +- name: web + port: 50070 +- name: datanode + port: 50010 +- name: datanode-icp + port: 50020 +- name: datanode-http + port: 50075 + selector: +name: hadoop + type: NodePort + +--- + +apiVersion: v1 +kind: ReplicationController +metadata: + name: hadoop + labels: +name: hadoop +spec: + replicas: 1 + selector: +name: hadoop + template: +metadata: + labels: +name: hadoop +spec: + containers: +- name: hadoop + image: sequenceiq/hadoop-docker:2.7.1 + ports: +- name: sshd + containerPort: 2
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16338612#comment-16338612 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4401: [BEAM-3060] Support for Perfkit execution of file-based-io-tests on HDFS cluster. URL: https://github.com/apache/beam/pull/4401 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml b/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml new file mode 100644 index 000..72f458a9bc8 --- /dev/null +++ b/.test-infra/kubernetes/hadoop/SmallITCluster/pkb-config.yml @@ -0,0 +1,40 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +#http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# This file is a pkb benchmark configuration file, used when running the IO ITs +# that use this data store. It allows users to run tests when they are on a +# separate network from the kubernetes cluster by reading the hadoop namenode IP +# address from the LoadBalancer service. +# +# When running Perfkit with DirectRunner - format pattern must additionally contain +# dfs.client.use.datanode.hostname set to true: +# format: '[{\"fs.defaultFS\":\"hdfs://{{LoadBalancerIp}}:9000\",\"dfs.replication\":1,\"dfs.client.use.datanode.hostname\":\"true\" }]' +# and /etc/hosts should be modified with an entry containing: +# LoadBalancerIp HadoopMasterPodName +# otherwise hdfs client won't be able to reach datanode. +# FilenamePrefix is used in file-based-io-tests. + +static_pipeline_options: +dynamic_pipeline_options: + - name: hdfsConfiguration +format: '[{\"fs.defaultFS\":\"hdfs://{{LoadBalancerIp}}:9000\",\"dfs.replication\":1}]' +type: LoadBalancerIp +serviceName: hadoop-external + - name: filenamePrefix +format: 'hdfs://{{LoadBalancerIp}}:9000/TEXTIO_IT_' +type: LoadBalancerIp +serviceName: hadoop-external diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index bd041040bc4..23c1b31c563 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -133,6 +133,110 @@ + +org.apache.maven.plugins +maven-surefire-plugin +${surefire-plugin.version} + +true + + + + + + + + +io-it-hdfs-small + +io-it-suite-hdfs-small + + + + ${project.parent.parent.parent.parent.basedir} + + + + +org.codehaus.gmaven +groovy-maven-plugin +${groovy-maven-plugin.version} + + +find-supported-python-for-compile +initialize + +execute + + + ${beamRootProjectDir}/sdks/python/findSupportedPython.groovy + + + + + + +org.codehaus.mojo +exec-maven-plugin +${maven-exec-plugin.version} + + +verify + +exec + + + +
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16240377#comment-16240377 ] ASF GitHub Bot commented on BEAM-3060: -- GitHub user lgajowy opened a pull request: https://github.com/apache/beam/pull/4081 [BEAM-3060] Adds TextIOIT for DirectRunner and local filesystem This is one of multiple commits to resolve the 3060 issue. Currently only local filesystem, relatively small datasets and DirectRunner are supported. More runners, filesystems and larger dataset testing ability (of gigabytes size) will be added soon in further commits. See: https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit# Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/lgajowy/beam text-io-it Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/4081.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4081 commit c6c7070ad92424707d3720d3a4dc2c0fb6961440 Author: Łukasz Gajowy Date: 2017-10-31T09:25:22Z [BEAM-3060] Adds TextIOIT for DirectRunner and local filesystem This is one of multiple commits to resolve the 3060 issue. Currently only local filesystem, relatively small datasets and DirectRunner are supported. More runners, filesystems and larger dataset testing ability (of gigabytes size) will be added soon in further commits. See: https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit# > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16248250#comment-16248250 ] ASF GitHub Bot commented on BEAM-3060: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/4081 > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16249368#comment-16249368 ] ASF GitHub Bot commented on BEAM-3060: -- GitHub user DariuszAniszewski opened a pull request: https://github.com/apache/beam/pull/4120 [BEAM-3060] TextIOIT: DataFlow and PerfKit profiles + big hash This PR adds Maven profiles for DataFlow runner and PerfKit to `file-based-io-tests` Additionally hash for large dataset is added and doc for `TextIOIT` is fixed. Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/DariuszAniszewski/beam textioit-dataflow-perfkit Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/4120.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4120 commit c787e317cf42b21e41cccdf4f2abfeb28f5ab7e3 Author: Dariusz Aniszewski Date: 2017-11-07T16:25:55Z Dataflow and PerfKit profiles; hash for 100.000.000 lines > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259421#comment-16259421 ] ASF GitHub Bot commented on BEAM-3060: -- GitHub user lgajowy opened a pull request: https://github.com/apache/beam/pull/4149 [BEAM-3060] Add Compressed TextIOIT Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is a parametrized test for Compressed TextIO. Only the Java code - @DariuszAniszewski is working on Perfkit support and Dataflow runner support on his separate branch. As ZIP compression type is unsupported, I skipped it in the test. @chamikaramj could you take a look? You can merge this pull request into a Git repository by running: $ git pull https://github.com/lgajowy/beam compressed-text-io-test Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/4149.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4149 commit df472abc6ee1b3c2ea021f6069beabd6a4439907 Author: Łukasz Gajowy Date: 2017-11-20T16:00:54Z [BEAM-3060] Add Compressed TextIOIT > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16260003#comment-16260003 ] ASF GitHub Bot commented on BEAM-3060: -- Github user asfgit closed the pull request at: https://github.com/apache/beam/pull/4120 > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264332#comment-16264332 ] ASF GitHub Bot commented on BEAM-3060: -- GitHub user szewi opened a pull request: https://github.com/apache/beam/pull/4169 [BEAM-3060] Added support for multiple filesystems in TextIO Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- You can merge this pull request into a Git repository by running: $ git pull https://github.com/szewi/beam filesystems-io-it Alternatively you can review and apply these changes as the patch at: https://github.com/apache/beam/pull/4169.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4169 commit 8d3d2b5e966ffb820b46007afad0244fd0c384bc Author: Kamil Szewczyk Date: 2017-11-21T19:50:04Z Added support for multiple filesystems in TextIO > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264331#comment-16264331 ] ASF GitHub Bot commented on BEAM-3060: -- szewi opened a new pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264333#comment-16264333 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on issue #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#issuecomment-346620194 Hi @chamikaramj , can you please take a look? This allow to switch between filesystems by adding system property -Dfilesystem and provide filesystem specific pipeline options. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264501#comment-16264501 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346653319 @chamikaramj (this message is a kind reminder) :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264954#comment-16264954 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900435 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) Review comment: This means that this test will be picked up by all test suites (including Java pre-commit), isn't it ? Not sure if we want to do that due to the size of this test. Adding to post-commit tests should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264952#comment-16264952 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900262 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { Review comment: Does perfkitbenchmarker-based execution (https://github.com/apache/beam/pull/4120) still work with these changes ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264953#comment-16264953 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900718 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Test +public void writeThenReadAll() { + PCollection testFilenames = pipeline + .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) + .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) + .apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) + .getPerDestinationOutputFilenames().apply(Values.create()); + + PCollection consolidatedHashcode = testFilenames + .apply("Read all files", TextIO.readAll()) + .apply("Calculate hashcode", Combine.globally(new HashingFn())); + + String expectedHash = getExpectedHashForLineCount(numberOfTextLines); + PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash); + + testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn()) + .withSideInputs(consolidatedHashcode.apply(View.asSingleton(; + + pipeline.run().waitUntilFinish(); +} + } + + /** IO IT with various compression types. */ + @RunWith(Parameterized.class) + public static class CompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Parameterized.Parameters() +public static Iterable data() { + return ImmutableList.builder() + .add(GZIP) + .add(DEFLATE) + .add(BZIP2) + .build(); +} + +@Parameterized.Parameter() +public Compression compression; + +@Test +public void writeThenReadAllWithCompression() { + TextIO.TypedWrite write = TextIO + .write() + .to(filenamePrefix) + .withOutputFilenames() + .withCompression(compression); + + TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO); -PCollection consolidatedHashcode = testFilenames -.apply("Read all files", TextIO.readAll()) -.apply("Calculate hashcode", Combine.globally(new HashingFn())); + PCollection testFilenames = pipeline Review comment: This and uncompressed version have the same pipeline. Can't we share to code between tests (and keep the same test class TextIOIT) and add "compression type" as a parameter to the test (a Maven -D parameter for the perfkit based runs) ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265384#comment-16265384 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152998002 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) Review comment: I double-checked that by running the preCommit job on my machine - those are not fired in PreCommit phase. Also, out of curiosity 😄 I investigated a little bit the project's mvn structure: besides the `@RunWith(JUnit.class)` annotation that is required by JUnit, we have two mvn plugins that look (scan) for tests: - surefire (looks for unit tests and searches for classes with *Test suffix) - failsafe (looks for integration tests and searches for classes with *IT suffix) As failsafe is not fired in the PreCommit phase, the tests are not invoked. Please look at [io parent pom](https://github.com/apache/beam/blob/master/sdks/java/io/pom.xml#L77), where failsafe plugin is activated only when io-it profile is active. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265386#comment-16265386 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152998058 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { Review comment: Yes, it works but runs all the 4 tests that are there in the file. But now I think this is probably not what we want. This won't be a problem as you suggested an even better solution in the comment below. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265393#comment-16265393 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152999363 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Test +public void writeThenReadAll() { + PCollection testFilenames = pipeline + .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) + .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) + .apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) + .getPerDestinationOutputFilenames().apply(Values.create()); + + PCollection consolidatedHashcode = testFilenames + .apply("Read all files", TextIO.readAll()) + .apply("Calculate hashcode", Combine.globally(new HashingFn())); + + String expectedHash = getExpectedHashForLineCount(numberOfTextLines); + PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash); + + testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn()) + .withSideInputs(consolidatedHashcode.apply(View.asSingleton(; + + pipeline.run().waitUntilFinish(); +} + } + + /** IO IT with various compression types. */ + @RunWith(Parameterized.class) + public static class CompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Parameterized.Parameters() +public static Iterable data() { + return ImmutableList.builder() + .add(GZIP) + .add(DEFLATE) + .add(BZIP2) + .build(); +} + +@Parameterized.Parameter() +public Compression compression; + +@Test +public void writeThenReadAllWithCompression() { + TextIO.TypedWrite write = TextIO + .write() + .to(filenamePrefix) + .withOutputFilenames() + .withCompression(compression); + + TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO); -PCollection consolidatedHashcode = testFilenames -.apply("Read all files", TextIO.readAll()) -.apply("Calculate hashcode", Combine.globally(new HashingFn())); + PCollection testFilenames = pipeline Review comment: I think it's hard to do right now without modifying perfkit's code. As we checked, perfkit ignores -D parameters because builds the mvn verify command by itself from the parameters passed . I think this could be done in some future contribution. We will file a bug report in perfkit soon. I think the best solution (at least for now) is to leave the compression type in pipeline options. We pass them to perfkit either way (through `beam_it_options`) and, what imo is more important, compressionType is very test specific (same as numberOfRecords). WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-bas
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265420#comment-16265420 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346866400 @chamikaramj thanks for the review! Here's another batch of changes, as commented above. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265643#comment-16265643 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346929621 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265649#comment-16265649 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042101 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -81,13 +81,38 @@ public static void setup() throws ParseException { .as(IOTestPipelineOptions.class); numberOfTextLines = options.getNumberOfRecords(); -filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +filenamePrefix = resolveProtocolAndPath(options); } private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } + private static String resolveProtocolAndPath(IOTestPipelineOptions options) { Review comment: I'm not sure why we need to parse and reassemble protocol here. We shouldn't have to do this if we ask user to give the full prefix that includes the protocol. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265648#comment-16265648 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042000 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: Why don't we use fileNamePrefix for all file-systems instead of introducing a property per file-system ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16265650#comment-16265650 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042053 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: Does this require GCS to be all caps ? If so is there a way to not require that ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267286#comment-16267286 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293511 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267285#comment-16267285 ] ASF GitHub Bot commented on BEAM-3060: -- kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293304 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267283#comment-16267283 ] ASF GitHub Bot commented on BEAM-3060: -- kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293304 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267290#comment-16267290 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293998 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: When provided -Dfilesystem=gcs it won't activate this profile. We should make decision whether uppercased or lowercased value of property is better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267291#comment-16267291 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293998 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: When provided `-Dfilesystem=gcs` it won't activate this profile. We should make decision whether uppercased or lowercased value of property is better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267298#comment-16267298 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293511 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on filenamePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267301#comment-16267301 ] ASF GitHub Bot commented on BEAM-3060: -- szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153296166 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -81,13 +81,38 @@ public static void setup() throws ParseException { .as(IOTestPipelineOptions.class); numberOfTextLines = options.getNumberOfRecords(); -filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +filenamePrefix = resolveProtocolAndPath(options); } private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } + private static String resolveProtocolAndPath(IOTestPipelineOptions options) { Review comment: Sure. However if in the future we will have kind of "validator" that will validate test input parameters, then it would reassemble `--filenamePrefix` I guess. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267683#comment-16267683 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-347345709 Merged. Closing. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267684#comment-16267684 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java index 91b3aa6d344..5a29d4f8126 100644 --- a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java +++ b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java @@ -100,4 +100,10 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("File compression type for writing and reading test files") + @Default.String("UNCOMPRESSED") + String getCompressionType(); + + void setCompressionType(String compressionType); } diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java index ecab1d86497..1b9b385a1ff 100644 --- a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java @@ -15,18 +15,24 @@ * See the License for the specific language governing permissions and * limitations under the License. */ + package org.apache.beam.sdk.io.text; +import static org.apache.beam.sdk.io.Compression.AUTO; + import com.google.common.base.Function; import com.google.common.collect.FluentIterable; import com.google.common.collect.ImmutableMap; import com.google.common.collect.Iterables; + import java.io.IOException; import java.text.ParseException; import java.util.Collection; import java.util.Collections; import java.util.Date; import java.util.Map; + +import org.apache.beam.sdk.io.Compression; import org.apache.beam.sdk.io.FileSystems; import org.apache.beam.sdk.io.GenerateSequence; import org.apache.beam.sdk.io.TextIO; @@ -50,13 +56,16 @@ import org.junit.runners.JUnit4; /** - * An integration test for {@link org.apache.beam.sdk.io.TextIO}. + * Integration tests for {@link org.apache.beam.sdk.io.TextIO}. * * Run this test using the command below. Pass in connection information via PipelineOptions: * - * mvn -e -Pio-it verify -pl sdks/java/io/text -DintegrationTestPipelineOptions='[ + * mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests + * -Dit.test=org.apache.beam.sdk.io.text.TextIOIT + * -DintegrationTestPipelineOptions='[ * "--numberOfRecords=10", * "--filenamePrefix=TEXTIOIT" + * "--compressionType=GZIP" * ]' * * */ @@ -65,6 +74,7 @@ private static String filenamePrefix; private static Long numberOfTextLines; + private static Compression compressionType; @Rule public TestPipeline pipeline = TestPipeline.create(); @@ -77,6 +87,16 @@ public static void setup() throws ParseException { numberOfTextLines = options.getNumberOfRecords(); filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +compressionType = parseCompressionType(options.getCompressionType()); + } + + private static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); +} catch (IllegalArgumentException ex) { + throw new IllegalArgumentException( + String.format("Unsupported compression type: %s", compressionType)); +} } private static String appendTimestamp(String filenamePrefix) { @@ -85,14 +105,20 @@ private static String appendTimestamp(String filenamePrefix) { @Test public void writeThenReadAll() { +TextIO.TypedWrite write = TextIO +.write() +.to(filenamePrefix) +.withOutputFilenames() +.withCompression(compressionType); + PCollection testFilenames = pipeline .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) +.apply("Write content to files", write) .getPerDestinationOutputFilenames().apply(Values.create()); PCollection consolidatedHashcode = testFilenames -.apply("Read all files", TextIO.readAll()) +.apply("Read a
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267956#comment-16267956 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153375847 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: I think it makes sense to simplify and make 'fileNamePrefix' the full prefix. Later if we hit a case where this is inadequate we can revisit. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267957#comment-16267957 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153376819 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: Will one of the solutions mentioned here work - https://stackoverflow.com/questions/10521860/property-autocapitalization-in-maven ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16267958#comment-16267958 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153376493 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -81,13 +81,38 @@ public static void setup() throws ParseException { .as(IOTestPipelineOptions.class); numberOfTextLines = options.getNumberOfRecords(); -filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +filenamePrefix = resolveProtocolAndPath(options); } private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } + private static String resolveProtocolAndPath(IOTestPipelineOptions options) { Review comment: Yeah, we can have validate that makes sure that fileSystem property and fileNamePrefix match. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269717#comment-16269717 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy opened a new pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Another test for the 3060 task. This one uses two pipelines (there seems to be no other way yet). I issued a JIRA regarding that: https://issues.apache.org/jira/browse/BEAM-3267 @chamikaramj could you take a look? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16269723#comment-16269723 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on issue #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#issuecomment-347701609 R: @jkff can you take this ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271614#comment-16271614 ] ASF GitHub Bot commented on BEAM-3060: -- jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r153927827 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { Review comment: There are no abstract methods in this class, it's just a collection of utility methods. In general inheritance is harder to deal with than composition. I suggest to change this class to be non-abstract but have a private constructor (non-instantiable), and have callers call its static methods directly rather than inheriting from the class. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271617#comment-16271617 ] ASF GitHub Bot commented on BEAM-3060: -- jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r153927333 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); +} catch (IllegalArgumentException ex) { + throw new IllegalArgumentException( + String.format("Unsupported compression type: %s", compressionType)); +} + } + + protected String getExpectedHashForLineCount(Long lineCount) { +Map expectedHashes = ImmutableMap.of( +100_000L, "4c8bb3b99dcc59459b20fefba400d446", +1_000_000L, "9796db06e7a7960f974d5a91164afff1", +100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95" +); + +String hash = expectedHashes.get(lineCount); +if (hash == null) { + throw new UnsupportedOperationException( + String.format("No hash for that line count: %s", lineCount) + ); +} +return hash; + } + + /** + * Constructs text lines in files used for testing. + */ + public static class DeterministicallyConstructTestTextLineFn extends DoFn { +@ProcessElement +public void processElement(ProcessContext c) { + c.output(String.format("IO IT Test line of text. Line seed: %s", c.element())); +} + } + + /** + * Deletes matching files using the FileSystems API. + */ + public static class DeleteFileFn extends DoFn { + +@ProcessElement +public void processElement(ProcessContext c) throws IOException { + MatchResult match = Iterables + .getOnlyElement(FileSystems.match(Collections.singletonList(c.element(; + + Collection resourceIds = toResourceIds(match); + + FileSystems.delete(resourceIds); +} +private Collection toResourceIds(MatchResult match) throws IOException { Review comment: This function occupies probably about 2x the amount of code that a simple loop would :) Suggest to inline it and replace with a loop. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271613#comment-16271613 ] ASF GitHub Bot commented on BEAM-3060: -- jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r153927538 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); +} catch (IllegalArgumentException ex) { + throw new IllegalArgumentException( + String.format("Unsupported compression type: %s", compressionType)); +} + } + + protected String getExpectedHashForLineCount(Long lineCount) { +Map expectedHashes = ImmutableMap.of( +100_000L, "4c8bb3b99dcc59459b20fefba400d446", +1_000_000L, "9796db06e7a7960f974d5a91164afff1", +100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95" +); + +String hash = expectedHashes.get(lineCount); +if (hash == null) { + throw new UnsupportedOperationException( + String.format("No hash for that line count: %s", lineCount) + ); +} +return hash; + } + + /** + * Constructs text lines in files used for testing. + */ + public static class DeterministicallyConstructTestTextLineFn extends DoFn { +@ProcessElement +public void processElement(ProcessContext c) { + c.output(String.format("IO IT Test line of text. Line seed: %s", c.element())); +} + } + + /** + * Deletes matching files using the FileSystems API. + */ + public static class DeleteFileFn extends DoFn { Review comment: Wonder if this makes sense to be in FileIO - PCollection.apply(FileIO.delete()) or something like that. Might be outside the scope of this PR though. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271615#comment-16271615 ] ASF GitHub Bot commented on BEAM-3060: -- jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r153927154 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); Review comment: Not sure this function is giving much benefit. I think it's not too much to ask from a user to specify compression type in uppercase, and also we're catching an IllegalArgumentException and throwing the same exception. I suggest to just use valueOf() instead of this whole function This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16271616#comment-16271616 ] ASF GitHub Bot commented on BEAM-3060: -- jkff commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r153928262 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.tfrecord; + +import static org.apache.beam.sdk.io.Compression.AUTO; + +import java.text.ParseException; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.GenerateSequence; +import org.apache.beam.sdk.io.TFRecordIO; +import org.apache.beam.sdk.io.common.AbstractFileBasedIOIT; +import org.apache.beam.sdk.io.common.HashingFn; +import org.apache.beam.sdk.io.common.IOTestPipelineOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Combine; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.MapElements; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.SimpleFunction; +import org.apache.beam.sdk.transforms.View; +import org.apache.beam.sdk.values.PCollection; +import org.junit.BeforeClass; +import org.junit.Rule; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for {@link org.apache.beam.sdk.io.TFRecordIO}. + * + * Run this test using the command below. Pass in connection information via PipelineOptions: + * + * mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests + * -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT + * -DintegrationTestPipelineOptions='[ + * "--numberOfRecords=10", + * "--filenamePrefix=FILEBASEDIOIT" Review comment: This would not actually be a valid prefix, right? It should be a real path, e.g. `gs://some-bucket/output`? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273472#comment-16273472 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on issue #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#issuecomment-348325070 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273473#comment-16273473 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj commented on issue #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#issuecomment-348325116 Run Java PreCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273630#comment-16273630 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index 6c3a7e3718b..812bfea363a 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +gcs + + + + +org.apache.beam + beam-sdks-java-io-google-cloud-platform +runtime + + + This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273673#comment-16273673 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r154234845 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); +} catch (IllegalArgumentException ex) { + throw new IllegalArgumentException( + String.format("Unsupported compression type: %s", compressionType)); +} + } + + protected String getExpectedHashForLineCount(Long lineCount) { +Map expectedHashes = ImmutableMap.of( +100_000L, "4c8bb3b99dcc59459b20fefba400d446", +1_000_000L, "9796db06e7a7960f974d5a91164afff1", +100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95" +); + +String hash = expectedHashes.get(lineCount); +if (hash == null) { + throw new UnsupportedOperationException( + String.format("No hash for that line count: %s", lineCount) + ); +} +return hash; + } + + /** + * Constructs text lines in files used for testing. + */ + public static class DeterministicallyConstructTestTextLineFn extends DoFn { +@ProcessElement +public void processElement(ProcessContext c) { + c.output(String.format("IO IT Test line of text. Line seed: %s", c.element())); +} + } + + /** + * Deletes matching files using the FileSystems API. + */ + public static class DeleteFileFn extends DoFn { Review comment: Do you suggest creating a separate JIRA for that? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performan
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273671#comment-16273671 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r154234810 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/tfrecord/TFRecordIOIT.java ## @@ -0,0 +1,133 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.tfrecord; + +import static org.apache.beam.sdk.io.Compression.AUTO; + +import java.text.ParseException; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.GenerateSequence; +import org.apache.beam.sdk.io.TFRecordIO; +import org.apache.beam.sdk.io.common.AbstractFileBasedIOIT; +import org.apache.beam.sdk.io.common.HashingFn; +import org.apache.beam.sdk.io.common.IOTestPipelineOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Combine; +import org.apache.beam.sdk.transforms.Create; +import org.apache.beam.sdk.transforms.MapElements; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.SimpleFunction; +import org.apache.beam.sdk.transforms.View; +import org.apache.beam.sdk.values.PCollection; +import org.junit.BeforeClass; +import org.junit.Rule; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * Integration tests for {@link org.apache.beam.sdk.io.TFRecordIO}. + * + * Run this test using the command below. Pass in connection information via PipelineOptions: + * + * mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests + * -Dit.test=org.apache.beam.sdk.io.tfrecord.TFRecordIOIT + * -DintegrationTestPipelineOptions='[ + * "--numberOfRecords=10", + * "--filenamePrefix=FILEBASEDIOIT" Review comment: Actually thanks to the fact that the to method resolved the path before submiting the pipeline to google cloud, we had a path created for us with the FILEBASEDIOIT name at the end. It looked like: `/Users/lukasz/.../FILEBASEDIOIT` and was valid even for Google Cloud Dataflow. I guess this would be an issue on machnes like windows - the path would be resolved to something like `c:\lukasz\...\FILEBASEDIOIT`. This would cause an error on GCP, right? Because of the above, I'll make the filenamePrefix a `@Validation.Required` option and change the comment to suggest giving custom path. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodi
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273674#comment-16273674 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r154234867 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); Review comment: ok. Also on second thought i think it's not the FileBasedIOIT class' responsibility to check this stuff. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273672#comment-16273672 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r154234823 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { + + protected static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +return TestPipeline.testingPipelineOptions().as(IOTestPipelineOptions.class); + } + + protected static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + protected static Compression parseCompressionType(String compressionType) { +try { + return Compression.valueOf(compressionType.toUpperCase()); +} catch (IllegalArgumentException ex) { + throw new IllegalArgumentException( + String.format("Unsupported compression type: %s", compressionType)); +} + } + + protected String getExpectedHashForLineCount(Long lineCount) { +Map expectedHashes = ImmutableMap.of( +100_000L, "4c8bb3b99dcc59459b20fefba400d446", +1_000_000L, "9796db06e7a7960f974d5a91164afff1", +100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95" +); + +String hash = expectedHashes.get(lineCount); +if (hash == null) { + throw new UnsupportedOperationException( + String.format("No hash for that line count: %s", lineCount) + ); +} +return hash; + } + + /** + * Constructs text lines in files used for testing. + */ + public static class DeterministicallyConstructTestTextLineFn extends DoFn { +@ProcessElement +public void processElement(ProcessContext c) { + c.output(String.format("IO IT Test line of text. Line seed: %s", c.element())); +} + } + + /** + * Deletes matching files using the FileSystems API. + */ + public static class DeleteFileFn extends DoFn { + +@ProcessElement +public void processElement(ProcessContext c) throws IOException { + MatchResult match = Iterables + .getOnlyElement(FileSystems.match(Collections.singletonList(c.element(; + + Collection resourceIds = toResourceIds(match); + + FileSystems.delete(resourceIds); +} +private Collection toResourceIds(MatchResult match) throws IOException { Review comment: ok, I guess I got too much inspired by the way it's done in java 8+ :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam >
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273675#comment-16273675 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on a change in pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#discussion_r154234884 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/AbstractFileBasedIOIT.java ## @@ -0,0 +1,111 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.base.Function; +import com.google.common.collect.FluentIterable; +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collection; +import java.util.Collections; +import java.util.Date; +import java.util.Map; +import org.apache.beam.sdk.io.Compression; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Abstract class for file based IO Integration tests. + */ +public abstract class AbstractFileBasedIOIT { Review comment: Ok, you're totally right about that. I didn't think it thorough well before. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16273695#comment-16273695 ] ASF GitHub Bot commented on BEAM-3060: -- lgajowy commented on issue #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189#issuecomment-348361581 @jkff Thanks again. Posted new changes. PTAL. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277033#comment-16277033 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4209: [BEAM-3060] AvroIOIT URL: https://github.com/apache/beam/pull/4209 Added integration test for AvroIO. **Note:** This branch is based on structural changes introduced in TFRecordIOIT (#4189). Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277061#comment-16277061 ] ASF GitHub Bot commented on BEAM-3060: -- szewi opened a new pull request #4210: [BEAM-3060] Temporary fix for failing tests on dataflow runner. URL: https://github.com/apache/beam/pull/4210 Bug is described in https://issues.apache.org/jira/projects/BEAM/issues/BEAM-3268 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277065#comment-16277065 ] ASF GitHub Bot commented on BEAM-3060: -- szewi closed pull request #4210: [BEAM-3060] Temporary fix for failing tests on dataflow runner. URL: https://github.com/apache/beam/pull/4210 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java index e9aac8001b1..5f3f5406d61 100644 --- a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java @@ -46,6 +46,7 @@ import org.apache.beam.sdk.transforms.Combine; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.Reshuffle; import org.apache.beam.sdk.transforms.Values; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.values.PCollection; @@ -118,7 +119,8 @@ public void writeThenReadAll() { .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) .apply("Write content to files", write) -.getPerDestinationOutputFilenames().apply(Values.create()); +.getPerDestinationOutputFilenames().apply(Values.create()) +.apply(Reshuffle.viaRandomKey()); PCollection consolidatedHashcode = testFilenames .apply("Read all files", TextIO.readAll().withCompression(AUTO)) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16277099#comment-16277099 ] ASF GitHub Bot commented on BEAM-3060: -- szewi opened a new pull request #4210: [BEAM-3060] Temporary fix for failing tests on dataflow runner. URL: https://github.com/apache/beam/pull/4210 Bug is described in https://issues.apache.org/jira/projects/BEAM/issues/BEAM-3268 Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16281008#comment-16281008 ] ASF GitHub Bot commented on BEAM-3060: -- jkff closed pull request #4189: [BEAM-3060] add TFRecordIOIT URL: https://github.com/apache/beam/pull/4189 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java index 5a29d4f8126..e7b475d4caa 100644 --- a/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java +++ b/sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java @@ -19,6 +19,7 @@ import org.apache.beam.sdk.options.Default; import org.apache.beam.sdk.options.Description; +import org.apache.beam.sdk.options.Validation; import org.apache.beam.sdk.testing.TestPipelineOptions; /** @@ -96,7 +97,7 @@ void setNumberOfRecords(Long count); @Description("Destination prefix for files generated by the test") - @Default.String("TEXTIOIT") + @Validation.Required String getFilenamePrefix(); void setFilenamePrefix(String prefix); diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java new file mode 100644 index 000..cf20d8e5954 --- /dev/null +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/common/FileBasedIOITHelper.java @@ -0,0 +1,103 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.io.common; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; +import java.io.IOException; +import java.util.Collections; +import java.util.Date; +import java.util.HashSet; +import java.util.Map; +import java.util.Set; +import org.apache.beam.sdk.io.FileSystems; +import org.apache.beam.sdk.io.fs.MatchResult; +import org.apache.beam.sdk.io.fs.ResourceId; +import org.apache.beam.sdk.options.PipelineOptionsFactory; +import org.apache.beam.sdk.options.PipelineOptionsValidator; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.DoFn; + +/** + * Contains helper methods for file based IO Integration tests. + */ +public class FileBasedIOITHelper { + + private FileBasedIOITHelper() { + } + + public static IOTestPipelineOptions readTestPipelineOptions() { +PipelineOptionsFactory.register(IOTestPipelineOptions.class); +IOTestPipelineOptions options = TestPipeline +.testingPipelineOptions() +.as(IOTestPipelineOptions.class); + +return PipelineOptionsValidator.validate(IOTestPipelineOptions.class, options); + } + + public static String appendTimestampToPrefix(String filenamePrefix) { +return String.format("%s_%s", filenamePrefix, new Date().getTime()); + } + + public static String getExpectedHashForLineCount(Long lineCount) { +Map expectedHashes = ImmutableMap.of( +100_000L, "4c8bb3b99dcc59459b20fefba400d446", +1_000_000L, "9796db06e7a7960f974d5a91164afff1", +100_000_000L, "6ce05f456e2fdc846ded2abd0ec1de95" +); + +String hash = expectedHashes.get(lineCount); +if (hash == null) { + throw new UnsupportedOperationException( + String.format("No hash for that line count: %s", lineCount) + ); +} +return hash; + } + + /** + * Constructs text lines in files used for testing. + */ + public static class DeterministicallyConstructTestTextLineFn extends DoFn { + +@ProcessElement +public void processElement(ProcessContext c) { + c.output(String.format("IO IT Test line of text. Line seed: %s", c.element())); +} + } + + /** + * Deletes matching fil
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16282749#comment-16282749 ] ASF GitHub Bot commented on BEAM-3060: -- jkff closed pull request #4209: [BEAM-3060] AvroIOIT URL: https://github.com/apache/beam/pull/4209 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index 812bfea363a..fc523f614fd 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -196,5 +196,11 @@ beam-sdks-java-io-common test + +org.apache.avro +avro +test + + diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java new file mode 100644 index 000..ce8da3357c9 --- /dev/null +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/avro/AvroIOIT.java @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.beam.sdk.io.avro; + +import static org.apache.beam.sdk.io.common.FileBasedIOITHelper.appendTimestampToPrefix; +import static org.apache.beam.sdk.io.common.FileBasedIOITHelper.getExpectedHashForLineCount; +import static org.apache.beam.sdk.io.common.FileBasedIOITHelper.readTestPipelineOptions; + +import org.apache.avro.Schema; +import org.apache.avro.generic.GenericRecord; +import org.apache.avro.generic.GenericRecordBuilder; +import org.apache.beam.sdk.coders.AvroCoder; +import org.apache.beam.sdk.io.AvroIO; +import org.apache.beam.sdk.io.GenerateSequence; +import org.apache.beam.sdk.io.common.FileBasedIOITHelper; +import org.apache.beam.sdk.io.common.HashingFn; +import org.apache.beam.sdk.io.common.IOTestPipelineOptions; +import org.apache.beam.sdk.testing.PAssert; +import org.apache.beam.sdk.testing.TestPipeline; +import org.apache.beam.sdk.transforms.Combine; +import org.apache.beam.sdk.transforms.DoFn; +import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.Values; +import org.apache.beam.sdk.transforms.View; +import org.apache.beam.sdk.values.PCollection; +import org.junit.BeforeClass; +import org.junit.Rule; +import org.junit.Test; +import org.junit.runner.RunWith; +import org.junit.runners.JUnit4; + +/** + * An integration test for {@link AvroIO}. + * + * Run this test using the command below. Pass in connection information via PipelineOptions: + * + * mvn -e -Pio-it verify -pl sdks/java/io/file-based-io-tests + * -Dit.test=org.apache.beam.sdk.io.avro.AvroIOIT + * -DintegrationTestPipelineOptions='[ + * "--numberOfRecords=10", + * "--filenamePrefix=output_file_path" + * ]' + * + * + * Please see 'sdks/java/io/file-based-io-tests/pom.xml' for instructions regarding + * running this test using Beam performance testing framework. + */ +@RunWith(JUnit4.class) +public class AvroIOIT { + + + private static final Schema AVRO_SCHEMA = new Schema.Parser().parse("{\n" + + " \"namespace\": \"ioitavro\",\n" + + " \"type\": \"record\",\n" + + " \"name\": \"TestAvroLine\",\n" + + " \"fields\": [\n" + + " {\"name\": \"row\", \"type\": \"string\"}\n" + + " ]\n" + + "}"); + + private static String filenamePrefix; + private static Long numberOfTextLines; + + @Rule + public TestPipeline pipeline = TestPipeline.create(); + + @BeforeClass + public static void setup() { +IOTestPipelineOptions options = readTestPipelineOptions(); + +numberOfTextLines = options.getNumberOfRecords(); +filenamePrefix = appendTimestampToPrefix(options.getFilenamePrefix()); + } + + @Test + public void writeThenReadAll() { + +PCollection testFilenames = pipeline +.apply("Generate sequence", GenerateSequence.fr
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16283054#comment-16283054 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4210: [BEAM-3060] Temporary fix for failing tests on dataflow runner. URL: https://github.com/apache/beam/pull/4210 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java index e9aac8001b1..5f3f5406d61 100644 --- a/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java +++ b/sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java @@ -46,6 +46,7 @@ import org.apache.beam.sdk.transforms.Combine; import org.apache.beam.sdk.transforms.DoFn; import org.apache.beam.sdk.transforms.ParDo; +import org.apache.beam.sdk.transforms.Reshuffle; import org.apache.beam.sdk.transforms.Values; import org.apache.beam.sdk.transforms.View; import org.apache.beam.sdk.values.PCollection; @@ -118,7 +119,8 @@ public void writeThenReadAll() { .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) .apply("Write content to files", write) -.getPerDestinationOutputFilenames().apply(Values.create()); +.getPerDestinationOutputFilenames().apply(Values.create()) +.apply(Reshuffle.viaRandomKey()); PCollection consolidatedHashcode = testFilenames .apply("Read all files", TextIO.readAll().withCompression(AUTO)) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16284139#comment-16284139 ] ASF GitHub Bot commented on BEAM-3060: -- DariuszAniszewski opened a new pull request #4238: [BEAM-3060] added support for passing extra mvn properties to pkb URL: https://github.com/apache/beam/pull/4238 Since [this PR on in PerfKit](https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1544) was merged, it's now possible to pass extra properties to be included into target mvn command when running tests with PerfKit. Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16286682#comment-16286682 ] ASF GitHub Bot commented on BEAM-3060: -- chamikaramj closed pull request #4238: [BEAM-3060] added support for passing extra mvn properties to pkb URL: https://github.com/apache/beam/pull/4238 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/sdks/java/io/file-based-io-tests/pom.xml b/sdks/java/io/file-based-io-tests/pom.xml index fc523f614fd..44119ec79ff 100644 --- a/sdks/java/io/file-based-io-tests/pom.xml +++ b/sdks/java/io/file-based-io-tests/pom.xml @@ -124,6 +124,11 @@ -beam_it_class=${fileBasedIoItClass} -beam_it_options=${integrationTestPipelineOptions} + + -beam_extra_mvn_properties=${pkbExtraProperties} diff --git a/sdks/java/io/pom.xml b/sdks/java/io/pom.xml index 0f8bc78fbe1..07e1b5cb9ff 100644 --- a/sdks/java/io/pom.xml +++ b/sdks/java/io/pom.xml @@ -37,6 +37,7 @@ + This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217232#comment-16217232 ] Szymon Nieradka commented on BEAM-3060: --- Please find proposed implementation description in: https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Chamikara Jayalath > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-3060) Add performance tests for commonly used file-based I/O PTransforms
[ https://issues.apache.org/jira/browse/BEAM-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16217762#comment-16217762 ] Chamikara Jayalath commented on BEAM-3060: -- Thanks for the proposal. Added some comments and assigned to JIRA to you. > Add performance tests for commonly used file-based I/O PTransforms > -- > > Key: BEAM-3060 > URL: https://issues.apache.org/jira/browse/BEAM-3060 > Project: Beam > Issue Type: Test > Components: sdk-java-core >Reporter: Chamikara Jayalath >Assignee: Szymon Nieradka > > We recently added a performance testing framework [1] that can be used to do > following. > (1) Execute Beam tests using PerfkitBenchmarker > (2) Manage Kubernetes-based deployments of data stores. > (3) Easily publish benchmark results. > I think it will be useful to add performance tests for commonly used > file-based I/O PTransforms using this framework. I suggest looking into > following formats initially. > (1) AvroIO > (2) TextIO > (3) Compressed text using TextIO > (4) TFRecordIO > It should be possibly to run these tests for various Beam runners (Direct, > Dataflow, Flink, Spark, etc.) and file-systems (GCS, local, HDFS, etc.) > easily. > In the initial version, tests can be made manually triggerable for PRs > through Jenkins. Later, we could make some of these tests run periodically > and publish benchmark results (to BigQuery) through PerfkitBenchmarker. > [1] https://beam.apache.org/documentation/io/testing/ -- This message was sent by Atlassian JIRA (v6.4.14#64029)