[GitHub] echauchot commented on a change in pull request #4143: Faster implementation of nexmark.Generator.nextExactString().
echauchot commented on a change in pull request #4143: Faster implementation of nexmark.Generator.nextExactString(). URL: https://github.com/apache/beam/pull/4143#discussion_r152771825 ## File path: sdks/java/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources/Generator.java ## @@ -353,8 +353,16 @@ private static String nextString(Random random, int maxLength) { /** Return a random string of exactly {@code length}. */ private static String nextExactString(Random random, int length) { StringBuilder sb = new StringBuilder(); +int rnd = 0; +int n = 0; // number of random characters left in rnd while (length-- > 0) { - sb.append((char) ('a' + random.nextInt(26))); + if (n == 0) { +rnd = random.nextInt(); +n = 6; // log_26(2^31) + } + sb.append((char) ('a' + rnd % 26)); + rnd /= 26; Review comment: We will concatenate ints between 0 and 25 as before. But if `rnd` happens to be < 26 then it will be set in this line to 0 for the next `n` iterations (6 at worse) leading to appending 0 to `sb`. So there will be a more probable 0 than previous implementation. Consider using `Random.intervalNextInt(length*26, Integer.MAX_VALUE)`. But it might be not a big deal This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] echauchot commented on a change in pull request #4128: [BEAM-3181][Nexmark][SQL] Implement query0
echauchot commented on a change in pull request #4128: [BEAM-3181][Nexmark][SQL] Implement query0 URL: https://github.com/apache/beam/pull/4128#discussion_r152798660 ## File path: sdks/java/nexmark/src/main/java/org/apache/beam/sdk/nexmark/model/sql/adapter/ModelFieldsAdapters.java ## @@ -0,0 +1,125 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.beam.sdk.nexmark.model.sql.adapter; + +import com.google.common.collect.ImmutableMap; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; +import java.util.Map; + +import org.apache.beam.sdk.nexmark.model.Auction; +import org.apache.beam.sdk.nexmark.model.Bid; +import org.apache.beam.sdk.nexmark.model.Person; + +/** + * Maps Java model classes to Beam SQL record types. + */ +public class ModelFieldsAdapters { + + public static final Map ADAPTERS = + ImmutableMap.builder() + .put(Auction.class, auctionAdapter()) + .put(Bid.class, bidAdapter()) + .put(Person.class, personAdapter()) + .build(); + + private static ModelFieldsAdapter personAdapter() { +return new ModelFieldsAdapter( +BeamRecordSqlTypeBuilder.of() +.withLongField("id") +.withStringField("name") Review comment: You're right it is out of the scope of this PR. all my point was to make the fields mapping generic enough to be included in sql package (like you extracted the other feature) rather than coding it for each pojo. Like you said above implementing `getFieldValue(pojo, fieldName)` using reflection might be a good thing if it is feasible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi opened a new pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi opened a new pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on issue #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on issue #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#issuecomment-346620194 Hi @chamikaramj , can you please take a look? This allow to switch between filesystems by adding system property -Dfilesystem and provide filesystem specific pipeline options. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346653319 @chamikaramj (this message is a kind reminder) :) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia opened a new pull request #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia opened a new pull request #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346703242 As part of the solution for the issue on the Spark runner (see BEAM-3187) I consider that enabling these tests for all runners makes sense. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346704040 Run Gearpump ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346704033 Run Apex ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900718 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Test +public void writeThenReadAll() { + PCollection testFilenames = pipeline + .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) + .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) + .apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) + .getPerDestinationOutputFilenames().apply(Values.create()); + + PCollection consolidatedHashcode = testFilenames + .apply("Read all files", TextIO.readAll()) + .apply("Calculate hashcode", Combine.globally(new HashingFn())); + + String expectedHash = getExpectedHashForLineCount(numberOfTextLines); + PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash); + + testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn()) + .withSideInputs(consolidatedHashcode.apply(View.asSingleton(; + + pipeline.run().waitUntilFinish(); +} + } + + /** IO IT with various compression types. */ + @RunWith(Parameterized.class) + public static class CompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Parameterized.Parameters() +public static Iterable data() { + return ImmutableList.builder() + .add(GZIP) + .add(DEFLATE) + .add(BZIP2) + .build(); +} + +@Parameterized.Parameter() +public Compression compression; + +@Test +public void writeThenReadAllWithCompression() { + TextIO.TypedWrite write = TextIO + .write() + .to(filenamePrefix) + .withOutputFilenames() + .withCompression(compression); + + TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO); -PCollection consolidatedHashcode = testFilenames -.apply("Read all files", TextIO.readAll()) -.apply("Calculate hashcode", Combine.globally(new HashingFn())); + PCollection testFilenames = pipeline Review comment: This and uncompressed version have the same pipeline. Can't we share to code between tests (and keep the same test class TextIOIT) and add "compression type" as a parameter to the test (a Maven -D parameter for the perfkit based runs) ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900262 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { Review comment: Does perfkitbenchmarker-based execution (https://github.com/apache/beam/pull/4120) still work with these changes ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
chamikaramj commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152900435 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) Review comment: This means that this test will be picked up by all test suites (including Java pre-commit), isn't it ? Not sure if we want to do that due to the size of this test. Adding to post-commit tests should be fine. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] dmytroivanov4206 opened a new pull request #4171: [BEAM-3008] Extends API for BigtableIO Read and Write by adding withInstanceId and withProjectId
dmytroivanov4206 opened a new pull request #4171: [BEAM-3008] Extends API for BigtableIO Read and Write by adding withInstanceId and withProjectId URL: https://github.com/apache/beam/pull/4171 Adds withInstanceId and withProjectId to the BigtableIO Read and Write classes, first out of four steps to fix [BEAM-3008] bug. Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] rmannibucau opened a new pull request #4172: BEAM-3243 support multiple anonymous classes from the same enclosing class in a pipeline
rmannibucau opened a new pull request #4172: BEAM-3243 support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172 Idea is to keep the "number" (suffix) of the anonymous class for anonymous dofn to support multiple anonymous dofn in the same pipeline. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] rmannibucau opened a new pull request #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work
rmannibucau opened a new pull request #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work URL: https://github.com/apache/beam/pull/4173 Often working on a feature or even more on a fix you only care about a language - which is probably most of the time java? When building the project, the python execution time is very important (like half of it on my machine). However you are sure you didn't affect it since the code is quite parallel and almost unrelated in term of dependency. This PR adds a java profile which skips python/go sdk when building. It is designed to be activable through a property you can put in your settings.xml if you always only work with java part of beam. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] xumingming commented on issue #4168: [BEAM-3238][SQL] Add BeamRecordSqlTypeBuilder
xumingming commented on issue #4168: [BEAM-3238][SQL] Add BeamRecordSqlTypeBuilder URL: https://github.com/apache/beam/pull/4168#issuecomment-346831152 I like this idea! One minor comment: Can we put the `BeamRecordSqlTypeBuilder` inside `BeamRecordSqlType`? it will keep the surface api of `org.apache.beam.sdk.extensions.sql` cleaner. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] sduskis commented on a change in pull request #4171: [BEAM-3008] Extends API for BigtableIO Read and Write by adding withInstanceId and withProjectId
sduskis commented on a change in pull request #4171: [BEAM-3008] Extends API for BigtableIO Read and Write by adding withInstanceId and withProjectId URL: https://github.com/apache/beam/pull/4171#discussion_r152978149 ## File path: sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/bigtable/BigtableIO.java ## @@ -78,38 +78,38 @@ * The Bigtable source returns a set of rows from a single table, returning a * {@code PCollection}. * - * To configure a Cloud Bigtable source, you must supply a table id and a {@link BigtableOptions} - * or builder configured with the project and other information necessary to identify the - * Bigtable instance. By default, {@link BigtableIO.Read} will read all rows in the table. The row - * range to be read can optionally be restricted using {@link BigtableIO.Read#withKeyRange}, and - * a {@link RowFilter} can be specified using {@link BigtableIO.Read#withRowFilter}. For example: + * To configure a Cloud Bigtable source, you must supply a table id, a project id, an instance + * id and optionally a {@link BigtableOptions} to provide more specific connection configuration. + * By default, {@link BigtableIO.Read} will read all rows in the table. The row range to be read + * can optionally be restricted using {@link BigtableIO.Read#withKeyRange}, and a {@link RowFilter} + * can be specified using {@link BigtableIO.Read#withRowFilter}. For example: * * {@code - * BigtableOptions.Builder optionsBuilder = - * new BigtableOptions.Builder() - * .setProjectId("project") - * .setInstanceId("instance"); * * Pipeline p = ...; * * // Scan the entire table. * p.apply("read", * BigtableIO.read() * .withBigtableOptions(optionsBuilder) Review comment: Can you please remove `.withBigtableOptions(optionsBuilder)` for this example? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152998002 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) Review comment: I double-checked that by running the preCommit job on my machine - those are not fired in PreCommit phase. Also, out of curiosity ? I investigated a little bit the project's mvn structure: besides the `@RunWith(JUnit.class)` annotation that is required by JUnit, we have two mvn plugins that look (scan) for tests: - surefire (looks for unit tests and searches for classes with *Test suffix) - failsafe (looks for integration tests and searches for classes with *IT suffix) As failsafe is not fired in the PreCommit phase, the tests are not invoked. Please look at [io parent pom](https://github.com/apache/beam/blob/master/sdks/java/io/pom.xml#L77), where failsafe plugin is activated only when io-it profile is active. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152998058 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { Review comment: Yes, it works but runs all the 4 tests that are there in the file. But now I think this is probably not what we want. This won't be a problem as you suggested an even better solution in the comment below. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT
lgajowy commented on a change in pull request #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#discussion_r152999363 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -83,25 +90,82 @@ private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } - @Test - public void writeThenReadAll() { -PCollection testFilenames = pipeline -.apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) -.apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) -.apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) -.getPerDestinationOutputFilenames().apply(Values.create()); + /** IO IT with no compression. */ + @RunWith(JUnit4.class) + public static class UncompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Test +public void writeThenReadAll() { + PCollection testFilenames = pipeline + .apply("Generate sequence", GenerateSequence.from(0).to(numberOfTextLines)) + .apply("Produce text lines", ParDo.of(new DeterministicallyConstructTestTextLineFn())) + .apply("Write content to files", TextIO.write().to(filenamePrefix).withOutputFilenames()) + .getPerDestinationOutputFilenames().apply(Values.create()); + + PCollection consolidatedHashcode = testFilenames + .apply("Read all files", TextIO.readAll()) + .apply("Calculate hashcode", Combine.globally(new HashingFn())); + + String expectedHash = getExpectedHashForLineCount(numberOfTextLines); + PAssert.thatSingleton(consolidatedHashcode).isEqualTo(expectedHash); + + testFilenames.apply("Delete test files", ParDo.of(new DeleteFileFn()) + .withSideInputs(consolidatedHashcode.apply(View.asSingleton(; + + pipeline.run().waitUntilFinish(); +} + } + + /** IO IT with various compression types. */ + @RunWith(Parameterized.class) + public static class CompressedTextIOIT { + +@Rule +public TestPipeline pipeline = TestPipeline.create(); + +@Parameterized.Parameters() +public static Iterable data() { + return ImmutableList.builder() + .add(GZIP) + .add(DEFLATE) + .add(BZIP2) + .build(); +} + +@Parameterized.Parameter() +public Compression compression; + +@Test +public void writeThenReadAllWithCompression() { + TextIO.TypedWrite write = TextIO + .write() + .to(filenamePrefix) + .withOutputFilenames() + .withCompression(compression); + + TextIO.ReadAll read = TextIO.readAll().withCompression(AUTO); -PCollection consolidatedHashcode = testFilenames -.apply("Read all files", TextIO.readAll()) -.apply("Calculate hashcode", Combine.globally(new HashingFn())); + PCollection testFilenames = pipeline Review comment: I think it's hard to do right now without modifying perfkit's code. As we checked, perfkit ignores -D parameters because builds the mvn verify command by itself from the parameters passed . I think this could be done in some future contribution. We will file a bug report in perfkit soon. I think the best solution (at least for now) is to leave the compression type in pipeline options. We pass them to perfkit either way (through `beam_it_options`) and, what imo is more important, compressionType is very test specific (same as numberOfRecords). WDYT? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] xumingming commented on issue #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work
xumingming commented on issue #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work URL: https://github.com/apache/beam/pull/4173#issuecomment-346862839 I like this idea. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
lgajowy commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346866400 @chamikaramj thanks for the review! Here's another batch of changes, as commented above. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346871884 Run Flink ValidatesRunner Run Spark ValidatesRunner Run Dataflow ValidatesRunner Run Apex ValidatesRunner Run Gearpump ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346703983 Run Flink ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346872487 Run Flink ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346704033 Run Apex ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346704040 Run Gearpump ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346704003 Run Dataflow ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346703990 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346872593 Run Apex ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346872574 Run Spark ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346871884 Run Flink ValidatesRunner Run Spark ValidatesRunner Run Dataflow ValidatesRunner Run Apex ValidatesRunner Run Gearpump ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346872611 Run Dataflow ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set
iemejia commented on issue #4170: Make ParDoLifecycleTest exception tests part of the ValidatesRunner set URL: https://github.com/apache/beam/pull/4170#issuecomment-346872620 Run Gearpump ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia opened a new pull request #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator
iemejia opened a new pull request #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator URL: https://github.com/apache/beam/pull/4174 Follow this checklist to help us incorporate your contribution quickly and easily: - [x] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [x] Each commit in the pull request should have a meaningful subject line and body. - [x] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [x] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [x] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on issue #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator
iemejia commented on issue #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator URL: https://github.com/apache/beam/pull/4174#issuecomment-346875689 Run Flink ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] iemejia commented on a change in pull request #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator
iemejia commented on a change in pull request #4174: [BEAM-3244] Ensure execution of teardown method on Flink's DoFnOperator URL: https://github.com/apache/beam/pull/4174#discussion_r153012280 ## File path: runners/flink/src/main/java/org/apache/beam/runners/flink/translation/wrappers/streaming/DoFnOperator.java ## @@ -380,7 +386,6 @@ public void close() throws Exception { } } checkFinishBundleTimer.cancel(true); Review comment: @aljoscha I had the doubt if this one should be moved to dispose too, given that close can eventually not be called. WDYT ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] reuvenlax commented on a change in pull request #4145: Many simplifications to WriteFiles
reuvenlax commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153034261 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -339,50 +297,189 @@ public boolean isWindowedWrites() { sink, computeNumShards, numShardsProvider, true, maxNumWritersPerBundle, sideInputs); } - private static class WriterKey { -private final BoundedWindow window; -private final PaneInfo paneInfo; -private final DestinationT destination; + @Override + public void validate(PipelineOptions options) { +sink.validate(options); + } -WriterKey(BoundedWindow window, PaneInfo paneInfo, DestinationT destination) { - this.window = window; - this.paneInfo = paneInfo; - this.destination = destination; + @Override + public WriteFilesResult expand(PCollection input) { +if (input.isBounded() == IsBounded.UNBOUNDED) { + checkArgument( + windowedWrites, + "Must use windowed writes when applying %s to an unbounded PCollection", + WriteFiles.class.getSimpleName()); +} +if (windowedWrites) { + // The reason for this is https://issues.apache.org/jira/browse/BEAM-1438 + // and similar behavior in other runners. + checkArgument( + computeNumShards != null || numShardsProvider != null, + "When using windowed writes, must specify number of output shards explicitly", + WriteFiles.class.getSimpleName()); } +this.writeOperation = sink.createWriteOperation(); +this.writeOperation.setWindowedWrites(windowedWrites); -@Override -public boolean equals(Object o) { - if (!(o instanceof WriterKey)) { -return false; - } - WriterKey other = (WriterKey) o; - return Objects.equal(window, other.window) - && Objects.equal(paneInfo, other.paneInfo) - && Objects.equal(destination, other.destination); +if (!windowedWrites) { + // Re-window the data into the global window and remove any existing triggers. + input = + input.apply( + "RewindowIntoGlobal", + Window.into(new GlobalWindows()) + .triggering(DefaultTrigger.of()) + .discardingFiredPanes()); +} + +Coder destinationCoder; +try { + destinationCoder = + getDynamicDestinations() + .getDestinationCoderWithDefault(input.getPipeline().getCoderRegistry()); + destinationCoder.verifyDeterministic(); +} catch (CannotProvideCoderException | NonDeterministicException e) { + throw new RuntimeException(e); +} +@SuppressWarnings("unchecked") +Coder windowCoder = +(Coder) input.getWindowingStrategy().getWindowFn().windowCoder(); +FileResultCoder fileResultCoder = +FileResultCoder.of(windowCoder, destinationCoder); + +PCollectionView numShardsView = +(computeNumShards == null) ? null : input.apply(computeNumShards); + +PCollection> tempFileResults = +(computeNumShards == null && numShardsProvider == null) +? input.apply( +"WriteUnshardedBundlesToTempFiles", Review comment: Unfortunately, refactoring into new PTransforms changes the name of every single sub step (since step names are hierarchical). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] reuvenlax commented on a change in pull request #4145: Many simplifications to WriteFiles
reuvenlax commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r152652760 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -824,177 +826,78 @@ public void startBundle() { public void processElement(ProcessContext c) { fileResults.add(c.element()); if (fixedNumShards == null) { -if (numShardsView != null) { - fixedNumShards = c.sideInput(numShardsView); -} else if (numShardsProvider != null) { - fixedNumShards = numShardsProvider.get(); -} else { - throw new IllegalStateException( - "When finalizing a windowed write, should have set fixed sharding"); -} +fixedNumShards = getFixedNumShards.apply(c); +checkState(fixedNumShards != null, "Windowed write should have set fixed sharding"); Review comment: Windowed (non triggered) writes in batch do not need fixed sharding This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT
chamikaramj commented on issue #4149: [BEAM-3060] Add Compressed TextIOIT URL: https://github.com/apache/beam/pull/4149#issuecomment-346929621 LGTM This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042053 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: Does this require GCS to be all caps ? If so is there a way to not require that ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042101 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -81,13 +81,38 @@ public static void setup() throws ParseException { .as(IOTestPipelineOptions.class); numberOfTextLines = options.getNumberOfRecords(); -filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +filenamePrefix = resolveProtocolAndPath(options); } private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } + private static String resolveProtocolAndPath(IOTestPipelineOptions options) { Review comment: I'm not sure why we need to parse and reassemble protocol here. We shouldn't have to do this if we ask user to give the full prefix that includes the protocol. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
chamikaramj commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153042000 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: Why don't we use fileNamePrefix for all file-systems instead of introducing a property per file-system ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline
jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172#issuecomment-347018533 Using multiple anonymous DoFn's with the same enclosing class within the same composite transform is already possible if you specify the transform name in .apply() - e.g.: .apply("Something", ParDo.of(new DoFn..)).apply("Something else", ParDo.of(new DoFn..)). This is a good thing rather than a bug, because using generated names like `Enclosing$1` is unstable w.r.t. pipeline update: any reordering of the anonymous classes, or adding a new one, or making an existing one be non-anonymous, will change the numbering and make the pipeline update-incompatible. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] rmannibucau commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline
rmannibucau commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172#issuecomment-347019445 I understand where it comes from but it is very boring when writing tests where it is not uncommon to write anonymous classes and the inline naming is not always doable when you write utility methods or reusable piece of pipelines. You would also note that anonymous algorithm with the "number" is somehow aligned on the fact to keep only the suffix in the nested class case which leads to as meaningless names (Important$Stuff leads to Stuff which is in general not very meaningful for the "task" context since it is hold by the enclosing class to avoid long and repeating names). The original issue is when it fails it is quite abstract and not very obvious. An alternative can be to enrich the error message with: 1. where the anonymous *classes* (all conflicting ones) are 2. how to fix it - passing a name to the apply I would be fine with this "not solution" fix as well, does it sound better for you? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline
jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172#issuecomment-347029760 Yeah, I would prefer an improved error message. I suppose you mean the message that says `Transform Foo does not have a stable unique name. This will prevent reloading of pipelines.` - that message definitely should point to specifying the name in .apply() as a fix. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] rmannibucau commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline
rmannibucau commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172#issuecomment-347035725 Something like Transform Foo$1 conflicts with Foo$2 in pipeline defined in MyTest line 56. You can fix it adding a name in apply invocations line 54 and 52. The long name is important cause otherwise you have no clue of the provenance. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tweise commented on issue #4074: [BEAM-3130] View.asMap() causes a ClassCastException in Apex runner
tweise commented on issue #4074: [BEAM-3130] View.asMap() causes a ClassCastException in Apex runner URL: https://github.com/apache/beam/pull/4074#issuecomment-347068452 @jkff tests are added This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] luke-zhu opened a new pull request #4176: [BEAM-3143] Type Inference Compatibility with Python 3
luke-zhu opened a new pull request #4176: [BEAM-3143] Type Inference Compatibility with Python 3 URL: https://github.com/apache/beam/pull/4176 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Builds of Holden's work to get a type inference solution that passes precommit tests on Python 2 and type inference unit tests on Python 3.5. The disassembler code may need more changes if we aim for 3.6+ due to the byteword to quadword change. I've ported some code from CPython's lib/dis to disassembly.py This should make any future migration to 3.6+ easier. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153058028 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): + """Creates and returns a new httplib2.Http instance. + Args: +http_class: Optional custom Http class to use. Review comment: Nit: s/Optiona/optional and s/Arguments/arguments This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153057996 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] Review comment: Why '_' ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153058033 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): + """Creates and returns a new httplib2.Http instance. + Args: +http_class: Optional custom Http class to use. +**kwargs: Arguments to pass to http_class constructor. + Returns: +An initialized httplib2.Http instance. + """ + proxy_info = httplib2.ProxyInfo( +proxy_type=3, +proxy_host=None, Review comment: Why do we have to specify 'None' values here instead of leaving default ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153058019 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or Review comment: Nit: s/Environment/environment This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153057981 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or Review comment: What should the format of this be ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153058069 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): + """Creates and returns a new httplib2.Http instance. + Args: +http_class: Optional custom Http class to use. +**kwargs: Arguments to pass to http_class constructor. + Returns: +An initialized httplib2.Http instance. + """ + proxy_info = httplib2.ProxyInfo( +proxy_type=3, +proxy_host=None, +proxy_port=None, +proxy_user=None, +proxy_pass=None, +proxy_rdns=None + ) + + for proxy_env_var in ['http_proxy', 'HTTP_PROXY', 'https_proxy', 'HTTPS_PROXY']: Review comment: Why do we have to mention all these values here ? Isn't it enough to use a single variable (say, http_proxy) and ask users to always use that ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153058006 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. Review comment: Log a warning or raise an error ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
chamikaramj commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153057990 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): Review comment: Log a warning that proxy_env_var is ignored. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jbonofre commented on issue #3808: [BEAM-1920] Add a Spark 2.x support in the Spark runner
jbonofre commented on issue #3808: [BEAM-1920] Add a Spark 2.x support in the Spark runner URL: https://github.com/apache/beam/pull/3808#issuecomment-347089270 Following the vote on the mailing lists, I updated the PR with Spark 2.x update only (no support of Spark 1 anymore). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153152127 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): Review comment: http_class and kwargs are never used in any call to `GetNewHttp`. Please remove these arguments, as they are not necessary. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153152125 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): + """Creates and returns a new httplib2.Http instance. + Args: +http_class: Optional custom Http class to use. +**kwargs: Arguments to pass to http_class constructor. + Returns: +An initialized httplib2.Http instance. + """ + proxy_info = httplib2.ProxyInfo( +proxy_type=3, +proxy_host=None, +proxy_port=None, +proxy_user=None, +proxy_pass=None, +proxy_rdns=None + ) + + for proxy_env_var in ['http_proxy', 'HTTP_PROXY', 'https_proxy', 'HTTPS_PROXY']: +if proxy_env_var in os.environ and os.environ[proxy_env_var]: + proxy_info = ProxyInfoFromEnvironmentVar(proxy_env_var) + break + + # Use a non-infinite SSL timeout to avoid hangs during network flakiness. + kwargs['timeout'] = DEFAULT_HTTP_TIMEOUT_SECONDS Review comment: kwargs is unnecessary--just pass the `timeout` kwarg directly. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153152122 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): + """Reads proxy info from the environment and converts to httplib2.ProxyInfo. + Args: +proxy_env_var: Environment variable string to read, such as http_proxy or + https_proxy. + Returns: +httplib2.ProxyInfo constructed from the environment string. + """ + proxy_url = os.environ.get(proxy_env_var) + if not proxy_url or not proxy_env_var.lower().startswith('http'): +return httplib2.ProxyInfo(httplib2.socks.PROXY_TYPE_HTTP, None, 0) + proxy_protocol = proxy_env_var.lower().split('_')[0] + if not proxy_url.lower().startswith('http'): +# proxy_info_from_url requires a protocol, which is always http or https. +proxy_url = proxy_protocol + '://' + proxy_url + return httplib2.proxy_info_from_url(proxy_url, method=proxy_protocol) + +def GetNewHttp(http_class=httplib2.Http, **kwargs): + """Creates and returns a new httplib2.Http instance. + Args: +http_class: Optional custom Http class to use. +**kwargs: Arguments to pass to http_class constructor. + Returns: +An initialized httplib2.Http instance. + """ + proxy_info = httplib2.ProxyInfo( +proxy_type=3, +proxy_host=None, +proxy_port=None, +proxy_user=None, +proxy_pass=None, +proxy_rdns=None + ) + + for proxy_env_var in ['http_proxy', 'HTTP_PROXY', 'https_proxy', 'HTTPS_PROXY']: +if proxy_env_var in os.environ and os.environ[proxy_env_var]: + proxy_info = ProxyInfoFromEnvironmentVar(proxy_env_var) + break + + # Use a non-infinite SSL timeout to avoid hangs during network flakiness. + kwargs['timeout'] = DEFAULT_HTTP_TIMEOUT_SECONDS + http = http_class(proxy_info=proxy_info, **kwargs) + return http Review comment: Directly return the object (no intermediate variable `http` necessary). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS
charlesccychen commented on a change in pull request #4136: [BEAM-3184] Added ProxyInfoFromEnvironmentVar() & GetNewHttp() methods for GCS URL: https://github.com/apache/beam/pull/4136#discussion_r153152130 ## File path: sdks/python/apache_beam/io/gcp/gcsio.py ## @@ -87,6 +87,50 @@ MAX_BATCH_OPERATION_SIZE = 100 +def ProxyInfoFromEnvironmentVar(proxy_env_var): Review comment: Please use PEP8 function names thoughout (https://www.python.org/dev/peps/pep-0008/#function-names). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] aaltay commented on issue #4176: [BEAM-3143] Type Inference Compatibility with Python 3
aaltay commented on issue #4176: [BEAM-3143] Type Inference Compatibility with Python 3 URL: https://github.com/apache/beam/pull/4176#issuecomment-347254939 cc: @robertwb This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] mdvorsky commented on a change in pull request #4143: Faster implementation of nexmark.Generator.nextExactString().
mdvorsky commented on a change in pull request #4143: Faster implementation of nexmark.Generator.nextExactString(). URL: https://github.com/apache/beam/pull/4143#discussion_r153266699 ## File path: sdks/java/nexmark/src/main/java/org/apache/beam/sdk/nexmark/sources/Generator.java ## @@ -353,8 +353,16 @@ private static String nextString(Random random, int maxLength) { /** Return a random string of exactly {@code length}. */ private static String nextExactString(Random random, int length) { StringBuilder sb = new StringBuilder(); +int rnd = 0; +int n = 0; // number of random characters left in rnd while (length-- > 0) { - sb.append((char) ('a' + random.nextInt(26))); + if (n == 0) { +rnd = random.nextInt(); +n = 6; // log_26(2^31) + } + sb.append((char) ('a' + rnd % 26)); + rnd /= 26; Review comment: The way I imagine it is that we're generating a number in base 26. We used to generate it one "digit" at a time, now we generate 6 digits at the same time, so the probability of a given resulting string should be roughly the same. E.g., now generating rnd=0 should be the same as previously generating 0 six times in a row. To make it exactly the same, I'd need to use 26^6 as the upper value for rnd, instead of Integer.MAX_VALUE. I could make that change if you'd like. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work
jkff commented on issue #4173: [DISCUSS] add a java profile to be able to skip python/go when not relevant for current work URL: https://github.com/apache/beam/pull/4173#issuecomment-347278456 Seems like a good idea, though please bring it up on the mailing list - I think it's been discussed before and it'd be good if people who participated in that discussion chimed in. As a workaround, I typically use `-pl sdks/java/core -am -amd`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153282900 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: Is this DoFn really needed? I'm wondering if your implementation of the combiner is sufficient for performance. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153282640 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { +private long limit; -public SampleAnyDoFn(long limit, PCollectionView> iterableView) { +public SampleAnyDoFn(long limit) { this.limit = limit; - this.iterableView = iterableView; } @ProcessElement public void processElement(ProcessContext c) { - for (T i : c.sideInput(iterableView)) { -if (limit-- <= 0) { - break; -} -c.output(i); + if (--limit >= 0) { Review comment: Please make sure to reset the limit in startBundle/finishBundle. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #3808: [BEAM-1920] Add a Spark 2.x support in the Spark runner
jkff commented on issue #3808: [BEAM-1920] Add a Spark 2.x support in the Spark runner URL: https://github.com/apache/beam/pull/3808#issuecomment-347279993 Please ping the thread when the PR is ready for review. I think you said on the mailing list that some tests are still pending? And, since this is a major change, please also run the ValidatesRunner tests and I think it'd be prudent to perform the typical manual release validation steps (quickstart, game examples) against a real Spark cluster. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff opened a new pull request #4177: [BEAM-2870] Strips partition decorators when creating/patching tables in batch
jkff opened a new pull request #4177: [BEAM-2870] Strips partition decorators when creating/patching tables in batch URL: https://github.com/apache/beam/pull/4177 Addresses https://stackoverflow.com/questions/47351578/create-dynamic-side-outputs-in-apache-beam-dataflow?noredirect=1#comment81973314_47351578 R: @chamikaramj CC: @reuvenlax This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293304 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
kamszPolidea commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293304 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293511 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on pipelinePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293998 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: When provided -Dfilesystem=gcs it won't activate this profile. We should make decision whether uppercased or lowercased value of property is better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293998 ## File path: sdks/java/io/file-based-io-tests/pom.xml ## @@ -139,6 +139,24 @@ + + +google-cloud-storage + + +filesystem +GCS Review comment: When provided `-Dfilesystem=gcs` it won't activate this profile. We should make decision whether uppercased or lowercased value of property is better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153293511 ## File path: sdks/java/io/common/src/test/java/org/apache/beam/sdk/io/common/IOTestPipelineOptions.java ## @@ -100,4 +100,14 @@ String getFilenamePrefix(); void setFilenamePrefix(String prefix); + + @Description("Google cloud storage - bucket_name/path") + String getGcsLocation(); Review comment: We can use `--filenamePrefix`, but then we need to provide full communication scheme there for GCS or HDFS, for instance `gs://bucket/path/file` or `hdfs://hadoop-master:port/dfs-path/file`. If we assume that user running tests will know it then those two gcsLocation and hdfsLocation could be ommited. This is basically implementation of our proposal https://docs.google.com/document/d/1dA-5s6OHiP_cz-NRAbwapoKF5MEC1wKps4A5tFbIPKE/edit#heading=h.29mfbxd6kc64 . Do you think would be better to remove those two pipeline options and just depend on filenamePrefix ? Should I also remove protocol resolving part then ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO
szewi commented on a change in pull request #4169: [BEAM-3060] Added support for multiple filesystems in TextIO URL: https://github.com/apache/beam/pull/4169#discussion_r153296166 ## File path: sdks/java/io/file-based-io-tests/src/test/java/org/apache/beam/sdk/io/text/TextIOIT.java ## @@ -81,13 +81,38 @@ public static void setup() throws ParseException { .as(IOTestPipelineOptions.class); numberOfTextLines = options.getNumberOfRecords(); -filenamePrefix = appendTimestamp(options.getFilenamePrefix()); +filenamePrefix = resolveProtocolAndPath(options); } private static String appendTimestamp(String filenamePrefix) { return String.format("%s_%s", filenamePrefix, new Date().getTime()); } + private static String resolveProtocolAndPath(IOTestPipelineOptions options) { Review comment: Sure. However if in the future we will have kind of "validator" that will validate test input parameters, then it would reassemble `--filenamePrefix` I guess. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] angoenka opened a new pull request #4178: [BEAM-3239] Adding debug server to sdk worker to get threaddumps
angoenka opened a new pull request #4178: [BEAM-3239] Adding debug server to sdk worker to get threaddumps URL: https://github.com/apache/beam/pull/4178 Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- Adding debug server to dump thread stack on http request This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn opened a new pull request #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn opened a new pull request #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179 ?hanged in PR/3976 This should fix BEAM-3120. Follow this checklist to help us incorporate your contribution quickly and easily: - [ ] Make sure there is a [JIRA issue](https://issues.apache.org/jira/projects/BEAM/issues/) filed for the change (usually before you start working on it). Trivial changes like typos do not require a JIRA issue. Your pull request should address just this issue, without pulling in other changes. - [ ] Each commit in the pull request should have a meaningful subject line and body. - [ ] Format the pull request title like `[BEAM-XXX] Fixes bug in ApproximateQuantiles`, where you replace `BEAM-XXX` with the appropriate JIRA issue. - [ ] Write a pull request description that is detailed enough to understand what the pull request does, how, and why. - [ ] Run `mvn clean verify` to make sure basic checks pass. A more thorough check will be performed on your pull request automatically. - [ ] If this contribution is large, please file an Apache [Individual Contributor License Agreement](https://www.apache.org/licenses/icla.pdf). --- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347305706 Run Python PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347309489 @kennknowles @jasonkuster @lukecwik - Do we have a process to test groovy changes before the commit? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] kennknowles commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
kennknowles commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347310170 You can `run seed job` and it will actually be installed right away. It can then be rolled back if broken. Other than that, you need to have a test Jenkins cluster, which is a bit of work. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] reuvenlax commented on issue #4116: [BEAM-2953] Part 1 of Multipart advanced timeseries examples
reuvenlax commented on issue #4116: [BEAM-2953] Part 1 of Multipart advanced timeseries examples URL: https://github.com/apache/beam/pull/4116#issuecomment-347321463 R: @kennknowles This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] pabloem opened a new pull request #4180: Updating dataflow API version to newer release.
pabloem opened a new pull request #4180: Updating dataflow API version to newer release. URL: https://github.com/apache/beam/pull/4180 r: @bjchambers This new release contains the changes to the Dataflow API protos for side input inter transform IO. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347331807 run seed job This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347331790 Thanks, that's helpful. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347332806 Run Python PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c?
tvalentyn commented on issue #4179: Fix repository path for Python Jenkins builds. Source directory was c? URL: https://github.com/apache/beam/pull/4179#issuecomment-347333086 Run Python PostCommit This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153329942 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: Probably not, but that might change the semantics a bit. Without this, we'll combine each window into into `n` or fewer elements. But I guess that's more natural? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153330138 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: OTOH the iterable view approeach won't even work with unbounded collection so guess this new semantic is better. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153332162 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: Not sure why you say that: views are also per-window, so I think it shouldn't matter whether the collection is bounded or unbounded. (though, of course, it'll be behaving weirdly in case of multiple trigger firings - see also https://issues.apache.org/jira/browse/BEAM-2305, maybe similar issues apply here too) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4145: Many simplifications to WriteFiles
jkff commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153291513 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/FileBasedSink.java ## @@ -686,40 +756,38 @@ public int compare( */ @VisibleForTesting @Experimental(Kind.FILESYSTEM) -final void copyToOutputFiles( +final void moveToOutputFiles( List, ResourceId>> resultsToFinalFilenames) throws IOException { int numFiles = resultsToFinalFilenames.size(); - if (numFiles > 0) { -LOG.debug("Copying {} files.", numFiles); -List srcFiles = new ArrayList<>(resultsToFinalFilenames.size()); -List dstFiles = new ArrayList<>(resultsToFinalFilenames.size()); -for (KV, ResourceId> entry : resultsToFinalFilenames) { - srcFiles.add(entry.getKey().getTempFilename()); - dstFiles.add(entry.getValue()); - LOG.info( - "Will copy temporary file {} to final location {}", - entry.getKey().getTempFilename(), - entry.getValue()); -} -// During a failure case, files may have been deleted in an earlier step. Thus -// we ignore missing files here. -FileSystems.copy(srcFiles, dstFiles, StandardMoveOptions.IGNORE_MISSING_FILES); - } else { -LOG.info("No output files to write."); + LOG.debug("Copying {} files.", numFiles); + List srcFiles = new ArrayList<>(resultsToFinalFilenames.size()); + List dstFiles = new ArrayList<>(resultsToFinalFilenames.size()); Review comment: Seems overkill - they are created right next to each other in code, and FileSystems.copy() already does that verification. I removed the size hints to make it a little simpler (preallocation probably doesn't matter here). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4145: Many simplifications to WriteFiles
jkff commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153294218 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -1011,14 +1002,19 @@ public void processElement(ProcessContext c) throws Exception { writer.open(uuid, destination); writer.close(); completeResults.add( - new FileResult<>(writer.getOutputFile(), shard, null, null, destination)); + new FileResult<>( + writer.getOutputFile(), + shard, + GlobalWindow.INSTANCE, + PaneInfo.ON_TIME_AND_ONLY_FIRING, Review comment: Previously the code was structured differently, and the values passed in this particular codepath ended up being ignored. I consolidated things somewhat to handle much of windowed and unwindowed case the same way, and made the requirements more strict, in particular that window and pane have to be always set. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4145: Many simplifications to WriteFiles
jkff commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153293639 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -527,13 +521,8 @@ public void processElement(ProcessContext c, BoundedWindow window) throws Except writer.cleanup(); throw e; } -int shardNumber = -shardNumberAssignment == ShardAssignment.ASSIGN_WHEN_WRITING -? c.element().getKey().getShardNumber() -: UNKNOWN_SHARDNUM; -c.output( -new FileResult<>( -writer.getOutputFile(), shardNumber, window, c.pane(), entry.getKey())); +int shard = c.element().getKey().getShardNumber(); Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4145: Many simplifications to WriteFiles
jkff commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153291649 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -672,8 +661,11 @@ public void processElement(ProcessContext context) throws IOException { // PCollection. There is a dependency between this ParDo and the first (the // WriteOperation PCollection as a side input), so this will happen after the // initial ParDo. -PCollection> results; -final PCollectionView numShardsView; +PCollectionView numShardsView = +(computeNumShards == null) ? null : input.apply(computeNumShards); +List> shardingSideInputs = numShardsView == null Review comment: Done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4145: Many simplifications to WriteFiles
jkff commented on a change in pull request #4145: Many simplifications to WriteFiles URL: https://github.com/apache/beam/pull/4145#discussion_r153291927 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/io/WriteFiles.java ## @@ -824,177 +826,78 @@ public void startBundle() { public void processElement(ProcessContext c) { fileResults.add(c.element()); if (fixedNumShards == null) { -if (numShardsView != null) { - fixedNumShards = c.sideInput(numShardsView); -} else if (numShardsProvider != null) { - fixedNumShards = numShardsProvider.get(); -} else { - throw new IllegalStateException( - "When finalizing a windowed write, should have set fixed sharding"); -} +fixedNumShards = getFixedNumShards.apply(c); +checkState(fixedNumShards != null, "Windowed write should have set fixed sharding"); Review comment: After https://github.com/apache/beam/pull/4124 they do - see https://github.com/apache/beam/pull/4137 for explanation. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
nevillelyh commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153334683 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: With the current implementation, I got `java.lang.IllegalArgumentException: Attempted to get side input window for GlobalWindow from non-global WindowFn` with the following snippet. Am I doing something wrong? ```java public class Test { public static void main(String[] args) { PipelineOptions options = PipelineOptionsFactory.create(); Pipeline p = Pipeline.create(options); p.apply(Create.of(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)) .apply(ParDo.of(new DoFn() { @ProcessElement public void processElement(ProcessContext c) { Integer x = c.element(); c.outputWithTimestamp(x, new Instant(x * 1)); } })) .apply(Window.into(FixedWindows.of(Duration.standardMinutes(1 .apply(Sample.any(3)); p.run(); } } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline
jkff commented on issue #4172: [BEAM-3243] support multiple anonymous classes from the same enclosing class in a pipeline URL: https://github.com/apache/beam/pull/4172#issuecomment-347341595 That would be optimal, and in general I really like the idea of capturing the stack trace of transform application and using it in all error messages. We already do this in PAssert, but it can be applied much more widely - e.g. errors during coder inference, validation errors, some runtime errors etc. I suspect that implementing that can have some unexpected stumbling blocks, so I would recommend to start with a simpler improvement - just make the error message clearly say that the fix is to specify name in apply(). However, if you're willing to drive an effort to use application stack traces in more places in Beam, starting with a discussion on the mailing list - that would be a wonderful contribution and I'd be happy to review it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance
jkff commented on a change in pull request #4175: [BEAM-3247] fix Sample.any performance URL: https://github.com/apache/beam/pull/4175#discussion_r153335970 ## File path: sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/Sample.java ## @@ -209,29 +202,67 @@ public void populateDisplayData(DisplayData.Builder builder) { } /** - * A {@link DoFn} that returns up to limit elements from the side input PCollection. + * A {@link DoFn} that outputs up to limit elements. */ - private static class SampleAnyDoFn extends DoFn { -long limit; -final PCollectionView> iterableView; + private static class SampleAnyDoFn extends DoFn { Review comment: This seems like (another) bug in Sample.any. It should use Combine.globally().withoutDefaults() rather than simply Combine.globally(). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4139: Update cloud spanner library to 0.29.0
jkff commented on issue #4139: Update cloud spanner library to 0.29.0 URL: https://github.com/apache/beam/pull/4139#issuecomment-347342299 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jkff commented on issue #4139: Update cloud spanner library to 0.29.0
jkff commented on issue #4139: Update cloud spanner library to 0.29.0 URL: https://github.com/apache/beam/pull/4139#issuecomment-347342562 Run Dataflow ValidatesRunner This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services