[jira] [Assigned] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts
[ https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-2271: --- Assignee: Ahmet Altay (was: Sourabh Bajaj) > Release guide or pom.xml needs update to avoid releasing Python binary > artifacts > > > Key: BEAM-2271 > URL: https://issues.apache.org/jira/browse/BEAM-2271 > Project: Beam > Issue Type: Bug > Components: sdk-py-core >Reporter: Daniel Halperin >Assignee: Ahmet Altay > Fix For: 2.3.0 > > > The following directories (and children) were discovered in 2.0.0-RC2 and > were present in 0.6.0. > {code} > sdks/python: build dist.eggs nose-1.3.7-py2.7.egg (and child > contents) > {code} > Ideally, these artifacts, which are created during setup and testing, would > get created in the {{sdks/python/target/}} subfolder where they will > automatically get ignored. More info below. > For 2.0.0, we will manually remove these files from the source release RC3+. > This should be fixed before the next release. > Here is a list of other paths that get excluded, should they be useful. > {code} > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*] > > > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties] > > {code} > This list is stored inside of this jar, which you can find by tracking > maven-assembly-plugin from the root apache pom: > https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6 > http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Closed] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk
[ https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj closed BEAM-1585. --- > Ability to add new file systems to beamFS in the python sdk > --- > > Key: BEAM-1585 > URL: https://issues.apache.org/jira/browse/BEAM-1585 > Project: Beam > Issue Type: New Feature > Components: sdk-py-core >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: Not applicable > > > BEAM-1441 implements the new BeamFileSystem in the python SDK but currently > lacks the ability to add user implemented file systems. > This needs to be executed in the worker so should be packaged correctly with > the pipeline code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2673) BigQuery Sink should use the Load API
[ https://issues.apache.org/jira/browse/BEAM-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099103#comment-16099103 ] Sourabh Bajaj commented on BEAM-2673: - No as there is no Truncate mode in streaming, you can only append or create a new table. > BigQuery Sink should use the Load API > - > > Key: BEAM-2673 > URL: https://issues.apache.org/jira/browse/BEAM-2673 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Ahmet Altay > > Currently the BigQuery sink is written to by using the streaming api in the > direct runner. Instead we should just use the load api and also simplify the > management of different create and write disposition. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2673) BigQuery Sink should use the Load API
Sourabh Bajaj created BEAM-2673: --- Summary: BigQuery Sink should use the Load API Key: BEAM-2673 URL: https://issues.apache.org/jira/browse/BEAM-2673 Project: Beam Issue Type: Bug Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay Currently the BigQuery sink is written to by using the streaming api in the direct runner. Instead we should just use the load api and also simplify the management of different create and write disposition. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2636) user_score on DataflowRunner is broken
[ https://issues.apache.org/jira/browse/BEAM-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093792#comment-16093792 ] Sourabh Bajaj commented on BEAM-2636: - This can be resolved now. > user_score on DataflowRunner is broken > -- > > Key: BEAM-2636 > URL: https://issues.apache.org/jira/browse/BEAM-2636 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > > UserScore has a custom transform named {{WriteToBigQuery}}, dataflow runner > has a special code handling transforms with that name, this will break for > all user transforms that has this name. > We can either: > - Handle this correctly > - Or document this as a reserved keyword and change the example. > cc: [~chamikara] -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts
[ https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090671#comment-16090671 ] Sourabh Bajaj commented on BEAM-2271: - https://github.com/apache/beam/pull/3441 creates a new zip which omits the {{.tox }} files etc. but I haven't been able to figure out how to override the actual source release file that is being created as I only see one entry for the execution. [~jbonofre] do you have any ideas on what I might be doing wrong in the PR. > Release guide or pom.xml needs update to avoid releasing Python binary > artifacts > > > Key: BEAM-2271 > URL: https://issues.apache.org/jira/browse/BEAM-2271 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Daniel Halperin >Assignee: Sourabh Bajaj > Fix For: 2.2.0 > > > The following directories (and children) were discovered in 2.0.0-RC2 and > were present in 0.6.0. > {code} > sdks/python: build dist.eggs nose-1.3.7-py2.7.egg (and child > contents) > {code} > Ideally, these artifacts, which are created during setup and testing, would > get created in the {{sdks/python/target/}} subfolder where they will > automatically get ignored. More info below. > For 2.0.0, we will manually remove these files from the source release RC3+. > This should be fixed before the next release. > Here is a list of other paths that get excluded, should they be useful. > {code} > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*] > > > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties] > > {code} > This list is stored inside of this jar, which you can find by tracking > maven-assembly-plugin from the root apache pom: > https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6 > http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts
[ https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-2271: Fix Version/s: (was: 2.1.0) 2.2.0 > Release guide or pom.xml needs update to avoid releasing Python binary > artifacts > > > Key: BEAM-2271 > URL: https://issues.apache.org/jira/browse/BEAM-2271 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Daniel Halperin >Assignee: Sourabh Bajaj > Fix For: 2.2.0 > > > The following directories (and children) were discovered in 2.0.0-RC2 and > were present in 0.6.0. > {code} > sdks/python: build dist.eggs nose-1.3.7-py2.7.egg (and child > contents) > {code} > Ideally, these artifacts, which are created during setup and testing, would > get created in the {{sdks/python/target/}} subfolder where they will > automatically get ignored. More info below. > For 2.0.0, we will manually remove these files from the source release RC3+. > This should be fixed before the next release. > Here is a list of other paths that get excluded, should they be useful. > {code} > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*] > > > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser] > > > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup] > > %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties] > > {code} > This list is stored inside of this jar, which you can find by tracking > maven-assembly-plugin from the root apache pom: > https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6 > http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema
[ https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087712#comment-16087712 ] Sourabh Bajaj commented on BEAM-2595: - [~andrea.pierleoni] can you verify that your pipeline works with the latest master ? > WriteToBigQuery does not work with nested json schema > - > > Key: BEAM-2595 > URL: https://issues.apache.org/jira/browse/BEAM-2595 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 2.1.0 > Environment: mac os local runner, Python >Reporter: Andrea Pierleoni >Assignee: Sourabh Bajaj >Priority: Minor > Labels: gcp > Fix For: 2.1.0 > > > I am trying to use the new `WriteToBigQuery` PTransform added to > `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1 > I need to write to a bigquery table with nested fields. > The only way to specify nested schemas in bigquery is with teh json schema. > None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the > json schema, but they accept a schema as an instance of the class > `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema` > I am composing the `TableFieldSchema` as suggested here > [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436], > and it looks fine when passed to the PTransform `WriteToBigQuery`. > The problem is that the base class `PTransformWithSideInputs` try to pickle > and unpickle the function > [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515] > (that includes the TableFieldSchema instance) and for some reason when the > class is unpickled some `FieldList` instance are converted to simple lists, > and the pickling validation fails. > Would it be possible to extend the test coverage to nested json objects for > bigquery? > They are also relatively easy to parse into a TableFieldSchema. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (BEAM-2572) Implement an S3 filesystem for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082887#comment-16082887 ] Sourabh Bajaj edited comment on BEAM-2572 at 7/11/17 8:16 PM: -- I think overall this plan makes sense but step 4 might be of unclear difficulty here as it depends on the runners propagating the pipeline options to all the workers correctly. It would be great to start a thread on the dev mailing list around this because we don't have a very clear story for credential passing to PTransforms yet. was (Author: sb2nov): I think overall this plan makes sense but step 4 might be of unclear difficulty here as it depends on the runners propagating the pipeline options to all the workers correctly. It would be great to start a thread around this because we don't have a very clear story for credential passing to PTransforms yet. > Implement an S3 filesystem for Python SDK > - > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py >Reporter: Dmitry Demeshchuk >Assignee: Ahmet Altay >Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082887#comment-16082887 ] Sourabh Bajaj commented on BEAM-2572: - I think overall this plan makes sense but step 4 might be of unclear difficulty here as it depends on the runners propagating the pipeline options to all the workers correctly. It would be great to start a thread around this because we don't have a very clear story for credential passing to PTransforms yet. > Implement an S3 filesystem for Python SDK > - > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py >Reporter: Dmitry Demeshchuk >Assignee: Ahmet Altay >Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (BEAM-2573) Better filesystem discovery mechanism in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145 ] Sourabh Bajaj edited comment on BEAM-2573 at 7/10/17 9:03 PM: -- I think the worker is specified in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81 so it is related to the version of the sdk being used. You don't need the release if you're on master as I can run the following command and see logs about failing to import the dataflow package on head and success for the builtin plugins. {code:none} python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project $PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp --job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location dist/apache-beam-2.1.0.dev0.tar.gz --beam_plugin dataflow.s3.aws.S3FS {code} {code:none} 13:54:41.574 Failed to import beam plugin dataflow.s3.aws.S3FS 13:54:41.573 Successfully imported beam plugin apache_beam.io.filesystem.FileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem {code} was (Author: sb2nov): I think the worker is specified in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81 so it is related to the version of the sdk being used. I might be wrong about needing the release if you're on master as I can run the following command and see logs about failing to import the dataflow package on head and success for the builtin plugins. {code:none} python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project $PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp --job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location dist/apache-beam-2.1.0.dev0.tar.gz --beam_plugin dataflow.s3.aws.S3FS {code} {code:none} 13:54:41.574 Failed to import beam plugin dataflow.s3.aws.S3FS 13:54:41.573 Successfully imported beam plugin apache_beam.io.filesystem.FileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem {code} > Better filesystem discovery mechanism in Python SDK > --- > > Key: BEAM-2573 > URL: https://issues.apache.org/jira/browse/BEAM-2573 > Project: Beam > Issue Type: Task > Components: runner-dataflow, sdk-py >Affects Versions: 2.0.0 >Reporter: Dmitry Demeshchuk >Priority: Minor > > It looks like right now custom filesystem classes have to be imported > explicitly: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30 > Seems like the current implementation doesn't allow discovering filesystems > that come from side packages, not from apache_beam itself. Even if I put a > custom FileSystem-inheriting class into a package and explicitly import it in > the root __init__.py of that package, it still doesn't make the class > discoverable. > The problems I'm experiencing happen on Dataflow runner, while Direct runner > works just fine. Here's an example of Dataflow output: > {code} > (320418708fe777d7): Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 581, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 166, in execute > op.start() > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py", > line 54, in start > self.output(windowed_value) > File "dataflow_worker/operations.py", line 138, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5768) > def output(self, windowed_value, output_index=0): > File "dataflow_worker/operations.py", line 139, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5654) > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "dataflow_worker/operations.py", line 72, in > dataflow_worker.operations.ConsumerSet.receive > (dataflow_worker/operations.c:3506) > cython.cast(Operation, consumer).process(windowed_value) > File "dataflow_worker/operations.py", line 328, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:11162) > with self.scoped_process_state: > File "dataflow_worker/operations.py", line 329, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:6) > self.dofn_receiver.
[jira] [Comment Edited] (BEAM-2573) Better filesystem discovery mechanism in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145 ] Sourabh Bajaj edited comment on BEAM-2573 at 7/10/17 9:03 PM: -- I think the worker is specified in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81 so it is related to the version of the sdk being used. I might be wrong about needing the release if you're on master as I can run the following command and see logs about failing to import the dataflow package on head and success for the builtin plugins. {code:none} python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project $PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp --job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location dist/apache-beam-2.1.0.dev0.tar.gz --beam_plugin dataflow.s3.aws.S3FS {code} {code:none} 13:54:41.574 Failed to import beam plugin dataflow.s3.aws.S3FS 13:54:41.573 Successfully imported beam plugin apache_beam.io.filesystem.FileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem 13:54:41.573 Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem {code} was (Author: sb2nov): I think the worker is specified in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81 so it is related to the version of the sdk being used. I might be wrong about needing the release if you're on master as I can run the following command and see logs about failing to import the dataflow package on head and success for the builtin plugins. {code:shell} python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project $PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp --job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location dist/apache-beam-2.1.0.dev0.tar.gz --beam_plugin dataflow.s3.aws.S3FS {code} > Better filesystem discovery mechanism in Python SDK > --- > > Key: BEAM-2573 > URL: https://issues.apache.org/jira/browse/BEAM-2573 > Project: Beam > Issue Type: Task > Components: runner-dataflow, sdk-py >Affects Versions: 2.0.0 >Reporter: Dmitry Demeshchuk >Priority: Minor > > It looks like right now custom filesystem classes have to be imported > explicitly: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30 > Seems like the current implementation doesn't allow discovering filesystems > that come from side packages, not from apache_beam itself. Even if I put a > custom FileSystem-inheriting class into a package and explicitly import it in > the root __init__.py of that package, it still doesn't make the class > discoverable. > The problems I'm experiencing happen on Dataflow runner, while Direct runner > works just fine. Here's an example of Dataflow output: > {code} > (320418708fe777d7): Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 581, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 166, in execute > op.start() > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py", > line 54, in start > self.output(windowed_value) > File "dataflow_worker/operations.py", line 138, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5768) > def output(self, windowed_value, output_index=0): > File "dataflow_worker/operations.py", line 139, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5654) > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "dataflow_worker/operations.py", line 72, in > dataflow_worker.operations.ConsumerSet.receive > (dataflow_worker/operations.c:3506) > cython.cast(Operation, consumer).process(windowed_value) > File "dataflow_worker/operations.py", line 328, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:11162) > with self.scoped_process_state: > File "dataflow_worker/operations.py", line 329, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:6) > self.dofn_receiver.receive(o) > File "apache_beam/runners/common.py", line 382, in > apache_beam.runners.common.DoFnRunner.receive > (apache_beam/runners/common.c:10156) > self.process(windowed_value) > File "apache_beam/runners/common.py", line 390, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10458) >
[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145 ] Sourabh Bajaj commented on BEAM-2573: - I think the worker is specified in https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81 so it is related to the version of the sdk being used. I might be wrong about needing the release if you're on master as I can run the following command and see logs about failing to import the dataflow package on head and success for the builtin plugins. {code:shell} python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project $PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp --job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location dist/apache-beam-2.1.0.dev0.tar.gz --beam_plugin dataflow.s3.aws.S3FS {code} > Better filesystem discovery mechanism in Python SDK > --- > > Key: BEAM-2573 > URL: https://issues.apache.org/jira/browse/BEAM-2573 > Project: Beam > Issue Type: Task > Components: runner-dataflow, sdk-py >Affects Versions: 2.0.0 >Reporter: Dmitry Demeshchuk >Priority: Minor > > It looks like right now custom filesystem classes have to be imported > explicitly: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30 > Seems like the current implementation doesn't allow discovering filesystems > that come from side packages, not from apache_beam itself. Even if I put a > custom FileSystem-inheriting class into a package and explicitly import it in > the root __init__.py of that package, it still doesn't make the class > discoverable. > The problems I'm experiencing happen on Dataflow runner, while Direct runner > works just fine. Here's an example of Dataflow output: > {code} > (320418708fe777d7): Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 581, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 166, in execute > op.start() > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py", > line 54, in start > self.output(windowed_value) > File "dataflow_worker/operations.py", line 138, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5768) > def output(self, windowed_value, output_index=0): > File "dataflow_worker/operations.py", line 139, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5654) > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "dataflow_worker/operations.py", line 72, in > dataflow_worker.operations.ConsumerSet.receive > (dataflow_worker/operations.c:3506) > cython.cast(Operation, consumer).process(windowed_value) > File "dataflow_worker/operations.py", line 328, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:11162) > with self.scoped_process_state: > File "dataflow_worker/operations.py", line 329, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:6) > self.dofn_receiver.receive(o) > File "apache_beam/runners/common.py", line 382, in > apache_beam.runners.common.DoFnRunner.receive > (apache_beam/runners/common.c:10156) > self.process(windowed_value) > File "apache_beam/runners/common.py", line 390, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10458) > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 431, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > (apache_beam/runners/common.c:11673) > raise new_exn, None, original_traceback > File "apache_beam/runners/common.py", line 388, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10371) > self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 281, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > (apache_beam/runners/common.c:8626) > self._invoke_per_window(windowed_value) > File "apache_beam/runners/common.py", line 307, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > (apache_beam/runners/common.c:9302) > windowed_value, self.process_method(*args_for_process)) > File > "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py", > line 749, in > File > "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py", > line 891, in > File > "/usr/local/lib/python2.7/dist-packages/apache_b
[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080974#comment-16080974 ] Sourabh Bajaj commented on BEAM-2573: - I think the plugins are working correctly if they are passing a list of class names to be imported at the start. You might need to wait for the next release as this required a change to the dataflow workers as they need to start importing the paths specified in the beam-plugins list. There is a release going on right now so that might happen in the next few days itself. I am not sure about the crash loop in windmillio [~charleschen] might know more. > Better filesystem discovery mechanism in Python SDK > --- > > Key: BEAM-2573 > URL: https://issues.apache.org/jira/browse/BEAM-2573 > Project: Beam > Issue Type: Task > Components: runner-dataflow, sdk-py >Affects Versions: 2.0.0 >Reporter: Dmitry Demeshchuk >Priority: Minor > > It looks like right now custom filesystem classes have to be imported > explicitly: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30 > Seems like the current implementation doesn't allow discovering filesystems > that come from side packages, not from apache_beam itself. Even if I put a > custom FileSystem-inheriting class into a package and explicitly import it in > the root __init__.py of that package, it still doesn't make the class > discoverable. > The problems I'm experiencing happen on Dataflow runner, while Direct runner > works just fine. Here's an example of Dataflow output: > {code} > (320418708fe777d7): Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 581, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 166, in execute > op.start() > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py", > line 54, in start > self.output(windowed_value) > File "dataflow_worker/operations.py", line 138, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5768) > def output(self, windowed_value, output_index=0): > File "dataflow_worker/operations.py", line 139, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5654) > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "dataflow_worker/operations.py", line 72, in > dataflow_worker.operations.ConsumerSet.receive > (dataflow_worker/operations.c:3506) > cython.cast(Operation, consumer).process(windowed_value) > File "dataflow_worker/operations.py", line 328, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:11162) > with self.scoped_process_state: > File "dataflow_worker/operations.py", line 329, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:6) > self.dofn_receiver.receive(o) > File "apache_beam/runners/common.py", line 382, in > apache_beam.runners.common.DoFnRunner.receive > (apache_beam/runners/common.c:10156) > self.process(windowed_value) > File "apache_beam/runners/common.py", line 390, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10458) > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 431, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > (apache_beam/runners/common.c:11673) > raise new_exn, None, original_traceback > File "apache_beam/runners/common.py", line 388, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10371) > self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 281, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > (apache_beam/runners/common.c:8626) > self._invoke_per_window(windowed_value) > File "apache_beam/runners/common.py", line 307, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > (apache_beam/runners/common.c:9302) > windowed_value, self.process_method(*args_for_process)) > File > "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py", > line 749, in > File > "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py", > line 891, in > File > "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py", > line 109, in _f > return fnc(self, *args, **kwargs) > File > "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", > line 146, in initialize_write > tmp_dir = self._create_temp_dir(file_path_prefix) >
[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079848#comment-16079848 ] Sourabh Bajaj commented on BEAM-2573: - [~chamikara] The issue with the register approach is that it needs to run in the python process of the worker so something from pipeline construction needs to tell the worker process to import these files and we use beam plugins as a transport mechanism for that. If you're doing register in the pipeline construction you'll have to import the class anyway so that kind of makes the registration pointless as the import itself would make the FS available to the subclasses module. [~demeshchuk] Can you try importing dataflow in your pipeline submission process and see if that makes the plugin available. You can just look at the pipeline options to see if it worked correctly as it should try to pass the full S3FS path in the --beam-plugins option. Ideally I think the best way to do this would have been changing the Path to be FS specific and the ReadFromText would take in a specialized S3Path object but that is a really large breaking change so we have to live with this. > Better filesystem discovery mechanism in Python SDK > --- > > Key: BEAM-2573 > URL: https://issues.apache.org/jira/browse/BEAM-2573 > Project: Beam > Issue Type: Task > Components: runner-dataflow, sdk-py >Affects Versions: 2.0.0 >Reporter: Dmitry Demeshchuk >Priority: Minor > > It looks like right now custom filesystem classes have to be imported > explicitly: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30 > Seems like the current implementation doesn't allow discovering filesystems > that come from side packages, not from apache_beam itself. Even if I put a > custom FileSystem-inheriting class into a package and explicitly import it in > the root __init__.py of that package, it still doesn't make the class > discoverable. > The problems I'm experiencing happen on Dataflow runner, while Direct runner > works just fine. Here's an example of Dataflow output: > {code} > (320418708fe777d7): Traceback (most recent call last): > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 581, in do_work > work_executor.execute() > File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", > line 166, in execute > op.start() > File > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py", > line 54, in start > self.output(windowed_value) > File "dataflow_worker/operations.py", line 138, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5768) > def output(self, windowed_value, output_index=0): > File "dataflow_worker/operations.py", line 139, in > dataflow_worker.operations.Operation.output > (dataflow_worker/operations.c:5654) > cython.cast(Receiver, > self.receivers[output_index]).receive(windowed_value) > File "dataflow_worker/operations.py", line 72, in > dataflow_worker.operations.ConsumerSet.receive > (dataflow_worker/operations.c:3506) > cython.cast(Operation, consumer).process(windowed_value) > File "dataflow_worker/operations.py", line 328, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:11162) > with self.scoped_process_state: > File "dataflow_worker/operations.py", line 329, in > dataflow_worker.operations.DoOperation.process > (dataflow_worker/operations.c:6) > self.dofn_receiver.receive(o) > File "apache_beam/runners/common.py", line 382, in > apache_beam.runners.common.DoFnRunner.receive > (apache_beam/runners/common.c:10156) > self.process(windowed_value) > File "apache_beam/runners/common.py", line 390, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10458) > self._reraise_augmented(exn) > File "apache_beam/runners/common.py", line 431, in > apache_beam.runners.common.DoFnRunner._reraise_augmented > (apache_beam/runners/common.c:11673) > raise new_exn, None, original_traceback > File "apache_beam/runners/common.py", line 388, in > apache_beam.runners.common.DoFnRunner.process > (apache_beam/runners/common.c:10371) > self.do_fn_invoker.invoke_process(windowed_value) > File "apache_beam/runners/common.py", line 281, in > apache_beam.runners.common.PerWindowInvoker.invoke_process > (apache_beam/runners/common.c:8626) > self._invoke_per_window(windowed_value) > File "apache_beam/runners/common.py", line 307, in > apache_beam.runners.common.PerWindowInvoker._invoke_per_window > (apache_beam/runners/common.c:9302) > windowed_value, self.process_method(*args_for_process)) > File > "/Users/dmitrydemeshchuk/minicon
[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079846#comment-16079846 ] Sourabh Bajaj commented on BEAM-2572: - +1 to not changing the signature of the ReadFromText. > Implement an S3 filesystem for Python SDK > - > > Key: BEAM-2572 > URL: https://issues.apache.org/jira/browse/BEAM-2572 > Project: Beam > Issue Type: Task > Components: sdk-py >Reporter: Dmitry Demeshchuk >Assignee: Ahmet Altay >Priority: Minor > > There are two paths worth exploring, to my understanding: > 1. Sticking to the HDFS-based approach (like it's done in Java). > 2. Using boto/boto3 for accessing S3 through its common API endpoints. > I personally prefer the second approach, for a few reasons: > 1. In real life, HDFS and S3 have different consistency guarantees, therefore > their behaviors may contradict each other in some edge cases (say, we write > something to S3, but it's not immediately accessible for reading from another > end). > 2. There are other AWS-based sources and sinks we may want to create in the > future: DynamoDB, Kinesis, SQS, etc. > 3. boto3 already provides somewhat good logic for basic things like > reattempting. > Whatever path we choose, there's another problem related to this: we > currently cannot pass any global settings (say, pipeline options, or just an > arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the > runner nodes to have AWS keys set up in the environment, which is not trivial > to achieve and doesn't look too clean either (I'd rather see one single place > for configuring the runner options). > Also, it's worth mentioning that I already have a janky S3 filesystem > implementation that only supports DirectRunner at the moment (because of the > previous paragraph). I'm perfectly fine finishing it myself, with some > guidance from the maintainers. > Where should I move on from here, and whose input should I be looking for? > Thanks! -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (BEAM-2437) quickstart.py docs is missing the path to MANIFEST.in
[ https://issues.apache.org/jira/browse/BEAM-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063939#comment-16063939 ] Sourabh Bajaj commented on BEAM-2437: - This can be closed now. > quickstart.py docs is missing the path to MANIFEST.in > - > > Key: BEAM-2437 > URL: https://issues.apache.org/jira/browse/BEAM-2437 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Jonathan Bingham >Assignee: Sourabh Bajaj >Priority: Minor > Labels: easyfix > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > SUMMARY > The wordcount example in quickstart-py does not work with the sample code > without modification. > OBSERVED > Copy-pasting from the doc page doesn't work: > python -m apache_beam.examples.wordcount --input MANIFEST.in --output counts > Error message: IOError: No files found based on the file pattern MANIFEST.in > EXPECTED > The example tells me to set the path to MANIFEST.in, or gives a pseudo-path > that I can substitute in the right path prefix. > python -m apache_beam.examples.wordcount --input > /[path-to-git-clone-dir]/beam/sdks/python/MANIFEST.in --output counts -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (BEAM-2437) quickstart.py docs is missing the path to MANIFEST.in
[ https://issues.apache.org/jira/browse/BEAM-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-2437: --- Assignee: Sourabh Bajaj (was: Ahmet Altay) > quickstart.py docs is missing the path to MANIFEST.in > - > > Key: BEAM-2437 > URL: https://issues.apache.org/jira/browse/BEAM-2437 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Jonathan Bingham >Assignee: Sourabh Bajaj >Priority: Minor > Labels: easyfix > Fix For: 2.1.0 > > Original Estimate: 1h > Remaining Estimate: 1h > > SUMMARY > The wordcount example in quickstart-py does not work with the sample code > without modification. > OBSERVED > Copy-pasting from the doc page doesn't work: > python -m apache_beam.examples.wordcount --input MANIFEST.in --output counts > Error message: IOError: No files found based on the file pattern MANIFEST.in > EXPECTED > The example tells me to set the path to MANIFEST.in, or gives a pseudo-path > that I can substitute in the right path prefix. > python -m apache_beam.examples.wordcount --input > /[path-to-git-clone-dir]/beam/sdks/python/MANIFEST.in --output counts -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk
[ https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1585. - Resolution: Fixed With the BeamPlugin system we can add additional filesystems and other plugins now. > Ability to add new file systems to beamFS in the python sdk > --- > > Key: BEAM-1585 > URL: https://issues.apache.org/jira/browse/BEAM-1585 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: Not applicable > > > BEAM-1441 implements the new BeamFileSystem in the python SDK but currently > lacks the ability to add user implemented file systems. > This needs to be executed in the worker so should be packaged correctly with > the pipeline code. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (BEAM-2405) Create a BigQuery sink for streaming using PTransform
[ https://issues.apache.org/jira/browse/BEAM-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-2405. - Resolution: Fixed Fix Version/s: 2.1.0 > Create a BigQuery sink for streaming using PTransform > - > > Key: BEAM-2405 > URL: https://issues.apache.org/jira/browse/BEAM-2405 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: 2.1.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (BEAM-2419) Refactor dependency.py as the types are inconsistently passed in the filecopy code
Sourabh Bajaj created BEAM-2419: --- Summary: Refactor dependency.py as the types are inconsistently passed in the filecopy code Key: BEAM-2419 URL: https://issues.apache.org/jira/browse/BEAM-2419 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2405) Create a BigQuery sink for streaming using PTransform
Sourabh Bajaj created BEAM-2405: --- Summary: Create a BigQuery sink for streaming using PTransform Key: BEAM-2405 URL: https://issues.apache.org/jira/browse/BEAM-2405 Project: Beam Issue Type: Bug Components: sdk-py Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2338) GCS filepattern wildcard broken in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022292#comment-16022292 ] Sourabh Bajaj commented on BEAM-2338: - This can be resolved now. > GCS filepattern wildcard broken in Python SDK > - > > Key: BEAM-2338 > URL: https://issues.apache.org/jira/browse/BEAM-2338 > Project: Beam > Issue Type: Bug > Components: beam-model >Affects Versions: 2.0.0 >Reporter: Vilhelm von Ehrenheim >Assignee: Sourabh Bajaj > Fix For: 2.1.0 > > > Validation of file patterns containing wildcard (`*`) in GCS directories does > not always work. > Some kinds of patterns generates an error from here during validation: > https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168 > I've tried a few different FileSystems match commands which confuses be a bit. > Full path works: > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, > 74721736)] > {noformat} > Glob star on directory does not > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [] > {noformat} > If adding a star on the file level only searching for TIF files it works (all > tough we match a different file but that is fine) > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, > 65862791)] > {noformat} > Ok, Here comes the even more strange case. > Looking for the same file we found with the patterns that but with a star on > the dir we find it!! > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, > 65862791)] > {noformat} > Also looking at the first case again we will match if the star is placed late > enough in the pattern to make the directory unique. > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, > 74721736)] > {noformat} > but not if further up in the name > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [] > {noformat} > My guess is that some folders are dropped from the list of matched > directories or something which is a bit concerning. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2338) GCS filepattern wildcard broken in Python SDK
[ https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021677#comment-16021677 ] Sourabh Bajaj commented on BEAM-2338: - You can work-around this issue by passing "validate=False" to the source and that would prevent the pipeline from hitting this bug > GCS filepattern wildcard broken in Python SDK > - > > Key: BEAM-2338 > URL: https://issues.apache.org/jira/browse/BEAM-2338 > Project: Beam > Issue Type: Bug > Components: beam-model >Affects Versions: 2.0.0 >Reporter: Vilhelm von Ehrenheim >Assignee: Sourabh Bajaj > Fix For: 2.1.0 > > > Validation of file patterns containing wildcard (`*`) in GCS directories does > not always work. > Some kinds of patterns generates an error from here during validation: > https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168 > I've tried a few different FileSystems match commands which confuses be a bit. > Full path works: > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, > 74721736)] > {noformat} > Glob star on directory does not > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [] > {noformat} > If adding a star on the file level only searching for TIF files it works (all > tough we match a different file but that is fine) > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, > 65862791)] > {noformat} > Ok, Here comes the even more strange case. > Looking for the same file we found with the patterns that but with a star on > the dir we find it!! > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF, > 65862791)] > {noformat} > Also looking at the first case again we will match if the star is placed late > enough in the pattern to make the directory unique. > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF, > 74721736)] > {noformat} > but not if further up in the name > {noformat} > >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'], > >>> limits=[1])[0].metadata_list > [] > {noformat} > My guess is that some folders are dropped from the list of matched > directories or something which is a bit concerning. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1833) Restructure Python pipeline construction to better follow the Runner API
[ https://issues.apache.org/jira/browse/BEAM-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020075#comment-16020075 ] Sourabh Bajaj commented on BEAM-1833: - Un-assigning this as it is not a blocker for RunnerAPI > Restructure Python pipeline construction to better follow the Runner API > > > Key: BEAM-1833 > URL: https://issues.apache.org/jira/browse/BEAM-1833 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Robert Bradshaw >Assignee: Sourabh Bajaj > > The most important part is removing the runner.apply overrides, but there are > also various other improvements (e.g. all inputs and outputs should be named). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1833) Restructure Python pipeline construction to better follow the Runner API
[ https://issues.apache.org/jira/browse/BEAM-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-1833: --- Assignee: (was: Sourabh Bajaj) > Restructure Python pipeline construction to better follow the Runner API > > > Key: BEAM-1833 > URL: https://issues.apache.org/jira/browse/BEAM-1833 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Robert Bradshaw > > The most important part is removing the runner.apply overrides, but there are > also various other improvements (e.g. all inputs and outputs should be named). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-2250) Remove FnHarness code from PyDocs
[ https://issues.apache.org/jira/browse/BEAM-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-2250. - Resolution: Fixed Assignee: Sourabh Bajaj (was: Ahmet Altay) > Remove FnHarness code from PyDocs > - > > Key: BEAM-2250 > URL: https://issues.apache.org/jira/browse/BEAM-2250 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: 2.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2250) Remove FnHarness code from PyDocs
Sourabh Bajaj created BEAM-2250: --- Summary: Remove FnHarness code from PyDocs Key: BEAM-2250 URL: https://issues.apache.org/jira/browse/BEAM-2250 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay Fix For: 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1283) DoFn finishBundle should be required to specify the window for output
[ https://issues.apache.org/jira/browse/BEAM-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003119#comment-16003119 ] Sourabh Bajaj commented on BEAM-1283: - This can be closed now. > DoFn finishBundle should be required to specify the window for output > - > > Key: BEAM-1283 > URL: https://issues.apache.org/jira/browse/BEAM-1283 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Kenneth Knowles >Assignee: Sourabh Bajaj > Labels: backward-incompatible > Fix For: 2.0.0 > > > The spec is here in Javadoc: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L128 > "If invoked from {{@StartBundle}} or {{@FinishBundle}}, this will attempt to > use the {{WindowFn}} of the input {{PCollection}} to determine what windows > the element should be in, throwing an exception if the {{WindowFn}} attempts > to access any information about the input element. The output element will > have a timestamp of negative infinity." > This is a collection of caveats that make this method not always technically > wrong, but quite a mess. Ideas that reasonable folks have suggested lately: > - The {{WindowFn}} cannot actually be applied because {{WindowFn}} is > allowed to see the element type. The spec just avoids this by limiting which > {{WindowFn}} can be used. > - There is no natural output timestamp, so it should always be provided. The > spec avoids this by specifying an arbitrary and fairly useless timestamp. > - If it is a merging {{WindowFn}} like sessions that has already been merged > then you'll just have a bogus proto window regardless of explicit timestamp > or not. > The use cases for these methods are best addressed by state plus window > expiry callback, so we should revisit this spec and probably just wipe it. > There are some rare case where you might need to output from {{FinishBundle}} > in a way that is not _actually_ sensitive to bundling (perhaps modulo some > downstream notion of equivalence) in which case you had better know what > window you are outputting to. Often it should be the global window. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2206) Move pipeline options into separate package from beam/utils
Sourabh Bajaj created BEAM-2206: --- Summary: Move pipeline options into separate package from beam/utils Key: BEAM-2206 URL: https://issues.apache.org/jira/browse/BEAM-2206 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj Fix For: 2.0.0 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2189) ImportError get_filesystem
[ https://issues.apache.org/jira/browse/BEAM-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999311#comment-15999311 ] Sourabh Bajaj commented on BEAM-2189: - This can be closed now > ImportError get_filesystem > -- > > Key: BEAM-2189 > URL: https://issues.apache.org/jira/browse/BEAM-2189 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj >Priority: Blocker > Fix For: 2.0.0 > > > Dataflow tests are broken after https://github.com/apache/beam/pull/2807 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors
[ https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997542#comment-15997542 ] Sourabh Bajaj commented on BEAM-2143: - Python is giving the right error here. [~vikasrk] do you want to take a stab at the Java part? > (Mis)Running Dataflow Wordcount gives non-helpful errors > > > Key: BEAM-2143 > URL: https://issues.apache.org/jira/browse/BEAM-2143 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-java-gcp >Reporter: Ben Chambers >Assignee: Sourabh Bajaj >Priority: Blocker > Fix For: First stable release > > > If you run a pipeline and forget to specify `tempLocation` (but specify > something else, such as `stagingLocation`) you get two messages indicating > you forgot to specify `stagingLocation`. > One says "no stagingLocation specified, choosing ..." the other says "error, > the staging location isn't readable" (if you give it just a bucket and not an > object within a bucket). > This is surprising to me as a user, since (1) I specified a staging location > and (2) the flag I actually need to modify is `--tempLocation`. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors
[ https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-2143: Component/s: sdk-java-gcp > (Mis)Running Dataflow Wordcount gives non-helpful errors > > > Key: BEAM-2143 > URL: https://issues.apache.org/jira/browse/BEAM-2143 > Project: Beam > Issue Type: Bug > Components: runner-dataflow, sdk-java-gcp >Reporter: Ben Chambers >Assignee: Sourabh Bajaj >Priority: Blocker > Fix For: First stable release > > > If you run a pipeline and forget to specify `tempLocation` (but specify > something else, such as `stagingLocation`) you get two messages indicating > you forgot to specify `stagingLocation`. > One says "no stagingLocation specified, choosing ..." the other says "error, > the staging location isn't readable" (if you give it just a bucket and not an > object within a bucket). > This is surprising to me as a user, since (1) I specified a staging location > and (2) the flag I actually need to modify is `--tempLocation`. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors
[ https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997488#comment-15997488 ] Sourabh Bajaj commented on BEAM-2143: - [~bchambers] do you see this in both sdks? > (Mis)Running Dataflow Wordcount gives non-helpful errors > > > Key: BEAM-2143 > URL: https://issues.apache.org/jira/browse/BEAM-2143 > Project: Beam > Issue Type: Bug > Components: runner-dataflow >Reporter: Ben Chambers >Assignee: Sourabh Bajaj >Priority: Blocker > Fix For: First stable release > > > If you run a pipeline and forget to specify `tempLocation` (but specify > something else, such as `stagingLocation`) you get two messages indicating > you forgot to specify `stagingLocation`. > One says "no stagingLocation specified, choosing ..." the other says "error, > the staging location isn't readable" (if you give it just a bucket and not an > object within a bucket). > This is surprising to me as a user, since (1) I specified a staging location > and (2) the flag I actually need to modify is `--tempLocation`. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-2163) Remove the dependency on examples from ptransform_test
[ https://issues.apache.org/jira/browse/BEAM-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-2163: --- Assignee: (was: Ahmet Altay) > Remove the dependency on examples from ptransform_test > -- > > Key: BEAM-2163 > URL: https://issues.apache.org/jira/browse/BEAM-2163 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj > Labels: newbie > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform_test.py#L176 > This validates runner test depends on the Counting source snippet and the > source should be copied here instead of this dependency. > The actual beam code should not depend on the examples package at all. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2163) Remove the dependency on examples from ptransform_test
Sourabh Bajaj created BEAM-2163: --- Summary: Remove the dependency on examples from ptransform_test Key: BEAM-2163 URL: https://issues.apache.org/jira/browse/BEAM-2163 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform_test.py#L176 This validates runner test depends on the Counting source snippet and the source should be copied here instead of this dependency. The actual beam code should not depend on the examples package at all. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-2152) Authentication fails if there is an unauthenticated gcloud tool even if application default credentials are available
[ https://issues.apache.org/jira/browse/BEAM-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-2152: --- Assignee: Sourabh Bajaj Fix Version/s: First stable release > Authentication fails if there is an unauthenticated gcloud tool even if > application default credentials are available > - > > Key: BEAM-2152 > URL: https://issues.apache.org/jira/browse/BEAM-2152 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Fix For: First stable release > > > In a machine that has valid application default credentials, if {{gcloud}} > tool is not installed authentication works. If {{gcloud}} tool is recently > installed but has not authenticated yet authentication fails with {{You do > not currently have an active account selected.}} > Authentication code should fallback to default method in this case. (Or the > {{gcloud}} based authentication needs to be fully removed. > cc: [~lcwik] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1954) "test" extra need nose in the requirements list
[ https://issues.apache.org/jira/browse/BEAM-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1954: Labels: newbie (was: ) > "test" extra need nose in the requirements list > --- > > Key: BEAM-1954 > URL: https://issues.apache.org/jira/browse/BEAM-1954 > Project: Beam > Issue Type: Bug > Components: sdk-py >Affects Versions: 0.6.0 >Reporter: Marc Boissonneault >Assignee: Ahmet Altay >Priority: Minor > Labels: newbie > > When installing beam with pip: > {code} > pip install apache-beam[gcp,test]==0.6.0 > {code} > And trying to use the TestPipeline: > {code} > from apache_beam.test_pipeline import TestPipeline > {code} > We receive the following error: > {panel} > from nose.plugins.skip import SkipTest > ImportError: No module named nose.plugins.skip > {panel} > Once nose is installed everything works as expected > See: https://github.com/apache/beam/blob/master/sdks/python/setup.py#L100-L102 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1316) DoFn#startBundle should not be able to output
[ https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1316: Component/s: (was: sdk-py) > DoFn#startBundle should not be able to output > - > > Key: BEAM-1316 > URL: https://issues.apache.org/jira/browse/BEAM-1316 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Thomas Groh > Fix For: First stable release > > > While within startBundle and finishBundle, the window in which elements are > output is not generally defined. Elements must always be output from within a > windowed context, or the {{WindowFn}} used by the {{PCollection}} may not > operate appropriately. > startBundle and finishBundle are suitable for operational duties, similarly > to {{setup}} and {{teardown}}, but within the scope of some collection of > input elements. This includes actions such as clearing field state within a > DoFn and ensuring all live RPCs complete successfully before committing > inputs. > Sometimes it might be reasonable to output from {{@FinishBundle}} but it is > hard to imagine a situation where output from {{@StartBundle}} is useful in a > way that doesn't seriously abuse things. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1316) DoFn#startBundle should not be able to output
[ https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993437#comment-15993437 ] Sourabh Bajaj commented on BEAM-1316: - [~altay] done. > DoFn#startBundle should not be able to output > - > > Key: BEAM-1316 > URL: https://issues.apache.org/jira/browse/BEAM-1316 > Project: Beam > Issue Type: Bug > Components: sdk-java-core >Reporter: Thomas Groh >Assignee: Thomas Groh > Fix For: First stable release > > > While within startBundle and finishBundle, the window in which elements are > output is not generally defined. Elements must always be output from within a > windowed context, or the {{WindowFn}} used by the {{PCollection}} may not > operate appropriately. > startBundle and finishBundle are suitable for operational duties, similarly > to {{setup}} and {{teardown}}, but within the scope of some collection of > input elements. This includes actions such as clearing field state within a > DoFn and ensuring all live RPCs complete successfully before committing > inputs. > Sometimes it might be reasonable to output from {{@FinishBundle}} but it is > hard to imagine a situation where output from {{@StartBundle}} is useful in a > way that doesn't seriously abuse things. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1316) DoFn#startBundle should not be able to output
[ https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1316: Component/s: sdk-py > DoFn#startBundle should not be able to output > - > > Key: BEAM-1316 > URL: https://issues.apache.org/jira/browse/BEAM-1316 > Project: Beam > Issue Type: Bug > Components: sdk-java-core, sdk-py >Reporter: Thomas Groh >Assignee: Thomas Groh > Fix For: First stable release > > > While within startBundle and finishBundle, the window in which elements are > output is not generally defined. Elements must always be output from within a > windowed context, or the {{WindowFn}} used by the {{PCollection}} may not > operate appropriately. > startBundle and finishBundle are suitable for operational duties, similarly > to {{setup}} and {{teardown}}, but within the scope of some collection of > input elements. This includes actions such as clearing field state within a > DoFn and ensuring all live RPCs complete successfully before committing > inputs. > Sometimes it might be reasonable to output from {{@FinishBundle}} but it is > hard to imagine a situation where output from {{@StartBundle}} is useful in a > way that doesn't seriously abuse things. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1989) clean SyntaxWarning
[ https://issues.apache.org/jira/browse/BEAM-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984118#comment-15984118 ] Sourabh Bajaj commented on BEAM-1989: - This can be closed now. > clean SyntaxWarning > --- > > Key: BEAM-1989 > URL: https://issues.apache.org/jira/browse/BEAM-1989 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj >Priority: Minor > > apache_beam/io/gcp/bigquery.py:326: SyntaxWarning: import * only allowed at > module level > def __init__(self, table=None, dataset=None, project=None, query=None, > apache_beam/io/gcp/bigquery.py:431: SyntaxWarning: import * only allowed at > module level > def __init__(self, table, dataset=None, project=None, schema=None, > cc: [~sb2nov][~chamikara] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1749) Upgrade pep8 to pycodestyle
[ https://issues.apache.org/jira/browse/BEAM-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984116#comment-15984116 ] Sourabh Bajaj commented on BEAM-1749: - This can be closed now. > Upgrade pep8 to pycodestyle > --- > > Key: BEAM-1749 > URL: https://issues.apache.org/jira/browse/BEAM-1749 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Labels: newbie, starter > > pep8 was deprecated and replaced with pycodestyle > We should upgrade our linter to this module, and while doing that re-evaluate > our linter strategy and see if we can enable more rules. This is important > for keeping the code healthy as the community grows. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1989) clean SyntaxWarning
[ https://issues.apache.org/jira/browse/BEAM-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-1989: --- Assignee: Sourabh Bajaj > clean SyntaxWarning > --- > > Key: BEAM-1989 > URL: https://issues.apache.org/jira/browse/BEAM-1989 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj >Priority: Minor > > apache_beam/io/gcp/bigquery.py:326: SyntaxWarning: import * only allowed at > module level > def __init__(self, table=None, dataset=None, project=None, query=None, > apache_beam/io/gcp/bigquery.py:431: SyntaxWarning: import * only allowed at > module level > def __init__(self, table, dataset=None, project=None, schema=None, > cc: [~sb2nov][~chamikara] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1749) Upgrade pep8 to pycodestyle
[ https://issues.apache.org/jira/browse/BEAM-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-1749: --- Assignee: Sourabh Bajaj > Upgrade pep8 to pycodestyle > --- > > Key: BEAM-1749 > URL: https://issues.apache.org/jira/browse/BEAM-1749 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Labels: newbie, starter > > pep8 was deprecated and replaced with pycodestyle > We should upgrade our linter to this module, and while doing that re-evaluate > our linter strategy and see if we can enable more rules. This is important > for keeping the code healthy as the community grows. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-2068) Upgrade Google-Apitools to latest version
[ https://issues.apache.org/jira/browse/BEAM-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-2068. - Resolution: Fixed Fix Version/s: First stable release > Upgrade Google-Apitools to latest version > - > > Key: BEAM-2068 > URL: https://issues.apache.org/jira/browse/BEAM-2068 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Ahmet Altay >Priority: Minor > Fix For: First stable release > > > In 0.5.9 apitools is pinned to setuptools 18.5 which is really old as the > current release is 35.0.1 at the time of creating the issue. Updating to > 0.5.9 causes issues for other dependencies so we're going to try to address > this upstream first and then upgrade to the latest version in the future. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1283) DoFn finishBundle should be required to specify the window for output
[ https://issues.apache.org/jira/browse/BEAM-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983733#comment-15983733 ] Sourabh Bajaj commented on BEAM-1283: - [~kenn] have we reached a resolution on this? Currently in Python if the user doesn't return a timestamped value we assign it to -1 for the output[1]. As we don't have state and timers implemented completely yet should we just assert and that we're getting timestampedValues only. [1] https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L310 > DoFn finishBundle should be required to specify the window for output > - > > Key: BEAM-1283 > URL: https://issues.apache.org/jira/browse/BEAM-1283 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-java-core, sdk-py >Reporter: Kenneth Knowles >Assignee: Thomas Groh > Labels: backward-incompatible > Fix For: First stable release > > > The spec is here in Javadoc: > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L128 > "If invoked from {{@StartBundle}} or {{@FinishBundle}}, this will attempt to > use the {{WindowFn}} of the input {{PCollection}} to determine what windows > the element should be in, throwing an exception if the {{WindowFn}} attempts > to access any information about the input element. The output element will > have a timestamp of negative infinity." > This is a collection of caveats that make this method not always technically > wrong, but quite a mess. Ideas that reasonable folks have suggested lately: > - The {{WindowFn}} cannot actually be applied because {{WindowFn}} is > allowed to see the element type. The spec just avoids this by limiting which > {{WindowFn}} can be used. > - There is no natural output timestamp, so it should always be provided. The > spec avoids this by specifying an arbitrary and fairly useless timestamp. > - If it is a merging {{WindowFn}} like sessions that has already been merged > then you'll just have a bogus proto window regardless of explicit timestamp > or not. > The use cases for these methods are best addressed by state plus window > expiry callback, so we should revisit this spec and probably just wipe it. > There are some rare case where you might need to output from {{FinishBundle}} > in a way that is not _actually_ sensitive to bundling (perhaps modulo some > downstream notion of equivalence) in which case you had better know what > window you are outputting to. Often it should be the global window. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1956) Flatten operation should respect input type hints.
[ https://issues.apache.org/jira/browse/BEAM-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1956: Fix Version/s: (was: First stable release) > Flatten operation should respect input type hints. > -- > > Key: BEAM-1956 > URL: https://issues.apache.org/jira/browse/BEAM-1956 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Vikas Kedigehalli >Assignee: Ahmet Altay > > Input type hints are currently not respected by the Flatten operation and > instead `Any` type is chosen as a fallback. This could lead to using a pickle > coder even if there was a custom coder type hint provided for input > PCollections. > Also, this could lead to undesirable results, particularly, when a Flatten > operation is followed by a GroupByKey operation which requires the key coder > to be deterministic. Even if the user provides deterministic coder type hints > to their PCollections, defaulting to Any would result in using the pickle > coder (non-deterministic). As a result of this, CoGroupByKey is broken in > such scenarios where input PCollection coder is deterministic for the type > while pickle coder is not. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-2068) Upgrade Google-Apitools to latest version
Sourabh Bajaj created BEAM-2068: --- Summary: Upgrade Google-Apitools to latest version Key: BEAM-2068 URL: https://issues.apache.org/jira/browse/BEAM-2068 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay Priority: Minor In 0.5.9 apitools is pinned to setuptools 18.5 which is really old as the current release is 35.0.1 at the time of creating the issue. Updating to 0.5.9 causes issues for other dependencies so we're going to try to address this upstream first and then upgrade to the latest version in the future. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1988) utils.path.join does not correctly handle GCS bucket roots
[ https://issues.apache.org/jira/browse/BEAM-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1988: Fix Version/s: First stable release > utils.path.join does not correctly handle GCS bucket roots > -- > > Key: BEAM-1988 > URL: https://issues.apache.org/jira/browse/BEAM-1988 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Fix For: First stable release > > > Here: > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/path.py#L22 > Joining a bucket root with a filename e.g. (gs://mybucket/ , myfile) results > in invalid 'gs://mybucket//myfile', notice the double // between mybucket and > myfile. (It actually does not handle anything that already ends with {{/}} > correctly) > [~sb2nov] could you take this one? Also, should the `join` operation move to > a BeamFileSystem level code. > (cc: [~chamikara]) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-662) SlidingWindows should support sub-second periods
[ https://issues.apache.org/jira/browse/BEAM-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977122#comment-15977122 ] Sourabh Bajaj commented on BEAM-662: This can be closed now. > SlidingWindows should support sub-second periods > > > Key: BEAM-662 > URL: https://issues.apache.org/jira/browse/BEAM-662 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Daniel Mills >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: First stable release > > > SlidingWindows periods are being rounded to seconds, see > http://stackoverflow.com/questions/39604646/can-slidingwindows-have-half-second-periods-in-python-apache-beam -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1964) Upgrade pylint to 1.7.0
[ https://issues.apache.org/jira/browse/BEAM-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969708#comment-15969708 ] Sourabh Bajaj commented on BEAM-1964: - [~altay] this can be resolved now. > Upgrade pylint to 1.7.0 > --- > > Key: BEAM-1964 > URL: https://issues.apache.org/jira/browse/BEAM-1964 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Aviem Zur >Assignee: Sourabh Bajaj > > Pre-commit tests seem to all be failing on pylint > For example: > https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_MavenInstall/9493/consoleFull -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-1964) Upgrade pylint to 1.7.0
[ https://issues.apache.org/jira/browse/BEAM-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-1964: --- Assignee: Sourabh Bajaj (was: Ahmet Altay) > Upgrade pylint to 1.7.0 > --- > > Key: BEAM-1964 > URL: https://issues.apache.org/jira/browse/BEAM-1964 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Aviem Zur >Assignee: Sourabh Bajaj > > Pre-commit tests seem to all be failing on pylint > For example: > https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_MavenInstall/9493/consoleFull -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-778) Make filesystem._CompressedFile seekable.
[ https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968309#comment-15968309 ] Sourabh Bajaj commented on BEAM-778: This can be closed now. > Make filesystem._CompressedFile seekable. > - > > Key: BEAM-778 > URL: https://issues.apache.org/jira/browse/BEAM-778 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Tibor Kiss > Fix For: Not applicable > > > We have a TODO to make filesystem._CompressedFile seekable. > https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692 > Without this, compressed file objects produce for FileBasedSource > implementations may not be able to use libraries that utilize methods seek() > and tell(). > For example tarfile.open(). -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1068) Service Account Credentials File Specified via Pipeline Option Ignored
[ https://issues.apache.org/jira/browse/BEAM-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968307#comment-15968307 ] Sourabh Bajaj commented on BEAM-1068: - This can be closed now. > Service Account Credentials File Specified via Pipeline Option Ignored > -- > > Key: BEAM-1068 > URL: https://issues.apache.org/jira/browse/BEAM-1068 > Project: Beam > Issue Type: Bug > Components: sdk-py > Environment: CentOS Linux release 7.1.1503 (Core) > Python 2.7.5 >Reporter: Stephen Reichling >Assignee: Ahmet Altay >Priority: Minor > Fix For: First stable release > > > When writing a pipeline that authenticates with Google Dataflow APIs using a > service account, specifying the path to that service account's credentials > file in the {{PipelineOptions}} object passed in to the pipeline does not > work, it only works when passed as a command-line flag. > For example, if I write code like so: > {code} > pipelineOptions = options.PipelineOptions() > gcOptions = pipelineOptions.view_as(options.GoogleCloudOptions) > gcOptions.service_account_name = 'My Service Account Name' > gcOptions.service_account_key_file = '/some/path/keyfile.p12' > pipeline = beam.Pipeline(options=pipelineOptions) > # ... add stages to the pipeline > p.run() > {code} > and execute it like so: > {{python ./my_pipeline.py}} > ...the service account I specify will not be used. > Only if I were to execute the code like so: > {{python ./my_pipeline.py --service_account_name 'My Service Account Name' > --service_account_key_file /some/path/keyfile.p12}} > ...does it actually use the service account. > The problem appears to be rooted in `auth.py` which reconstructs the > {{PipelineOptions}} object directly from {{sys.argv}} rather than using the > instance passed in to the pipeline: > https://github.com/apache/incubator-beam/blob/9ded359daefc6040d61a1f33c77563474fcb09b6/sdks/python/apache_beam/internal/auth.py#L129-L130 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1101) Remove inconsistencies in Python PipelineOptions
[ https://issues.apache.org/jira/browse/BEAM-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968305#comment-15968305 ] Sourabh Bajaj commented on BEAM-1101: - This can be closed now. > Remove inconsistencies in Python PipelineOptions > > > Key: BEAM-1101 > URL: https://issues.apache.org/jira/browse/BEAM-1101 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Pablo Estrada >Assignee: Sourabh Bajaj > Fix For: First stable release > > > Some options have been removed from Java, and some have different default > behavior in Java. Gotta make this consistent. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1708) Better error messages when GCP features are not installed
[ https://issues.apache.org/jira/browse/BEAM-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1708. - Resolution: Fixed Fix Version/s: First stable release > Better error messages when GCP features are not installed > -- > > Key: BEAM-1708 > URL: https://issues.apache.org/jira/browse/BEAM-1708 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: First stable release > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1068) Service Account Credentials File Specified via Pipeline Option Ignored
[ https://issues.apache.org/jira/browse/BEAM-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1068: Fix Version/s: First stable release > Service Account Credentials File Specified via Pipeline Option Ignored > -- > > Key: BEAM-1068 > URL: https://issues.apache.org/jira/browse/BEAM-1068 > Project: Beam > Issue Type: Bug > Components: sdk-py > Environment: CentOS Linux release 7.1.1503 (Core) > Python 2.7.5 >Reporter: Stephen Reichling >Assignee: Ahmet Altay >Priority: Minor > Fix For: First stable release > > > When writing a pipeline that authenticates with Google Dataflow APIs using a > service account, specifying the path to that service account's credentials > file in the {{PipelineOptions}} object passed in to the pipeline does not > work, it only works when passed as a command-line flag. > For example, if I write code like so: > {code} > pipelineOptions = options.PipelineOptions() > gcOptions = pipelineOptions.view_as(options.GoogleCloudOptions) > gcOptions.service_account_name = 'My Service Account Name' > gcOptions.service_account_key_file = '/some/path/keyfile.p12' > pipeline = beam.Pipeline(options=pipelineOptions) > # ... add stages to the pipeline > p.run() > {code} > and execute it like so: > {{python ./my_pipeline.py}} > ...the service account I specify will not be used. > Only if I were to execute the code like so: > {{python ./my_pipeline.py --service_account_name 'My Service Account Name' > --service_account_key_file /some/path/keyfile.p12}} > ...does it actually use the service account. > The problem appears to be rooted in `auth.py` which reconstructs the > {{PipelineOptions}} object directly from {{sys.argv}} rather than using the > instance passed in to the pipeline: > https://github.com/apache/incubator-beam/blob/9ded359daefc6040d61a1f33c77563474fcb09b6/sdks/python/apache_beam/internal/auth.py#L129-L130 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1222) Move GCS specific constants out of fileio
[ https://issues.apache.org/jira/browse/BEAM-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966746#comment-15966746 ] Sourabh Bajaj commented on BEAM-1222: - [~chamikara] this can be resolved now. > Move GCS specific constants out of fileio > - > > Key: BEAM-1222 > URL: https://issues.apache.org/jira/browse/BEAM-1222 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Sourabh Bajaj > > Currently fileio.py has GCS specific constant gcsio.MAX_BATCH_OPERATION_SIZE > which should be moved out of it. This will be needed to implement > IOChannelFactory proposal [1] for Python SDK. > [1] > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1924) pip download failed in post commit
[ https://issues.apache.org/jira/browse/BEAM-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963254#comment-15963254 ] Sourabh Bajaj commented on BEAM-1924: - One improvement that comes to mind is that we should pass the subprocess.PIPE options in this call similar to how we do it in the juliaset example > pip download failed in post commit > -- > > Key: BEAM-1924 > URL: https://issues.apache.org/jira/browse/BEAM-1924 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay > > A post commit failed with a pip failure: > https://builds.apache.org/job/beam_PostCommit_Python_Verify/1801/consoleFull > Captured output is not clear enough to tell what went wrong: > == > ERROR: test_par_do_with_multiple_outputs_and_using_return > (apache_beam.transforms.ptransform_test.PTransformTest) > -- > Traceback (most recent call last): > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/transforms/ptransform_test.py", > line 207, in test_par_do_with_multiple_outputs_and_using_return > pipeline.run() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/test_pipeline.py", > line 91, in run > result = super(TestPipeline, self).run() > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py", > line 160, in run > self.to_runner_api(), self.runner, self.options).run(False) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py", > line 169, in run > return self.runner.run(self) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py", > line 39, in run > self.result = super(TestDataflowRunner, self).run(pipeline) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py", > line 175, in run > self.dataflow_client.create_job(self.job), self) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/utils/retry.py", > line 174, in wrapper > return fun(*args, **kwargs) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py", > line 425, in create_job > self.create_job_description(job) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py", > line 446, in create_job_description > job.options, file_copy=self._gcs_file_copy) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/dependency.py", > line 306, in stage_job_resources > setup_options.requirements_file, requirements_cache_path) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/dependency.py", > line 242, in _populate_requirements_cache > processes.check_call(cmd_args) > File > "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/utils/processes.py", > line 40, in check_call > return subprocess.check_call(*args, **kwargs) > File "/usr/lib/python2.7/subprocess.py", line 540, in check_call > raise CalledProcessError(retcode, cmd) > CalledProcessError: Command > '['/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/bin/python', > '-m', 'pip', 'install', '--download', '/tmp/dataflow-requirements-cache', > '-r', 'postcommit_requirements.txt', '--no-binary', ':all:']' returned > non-zero exit status 2 > Any ideas on how we can debug this failure or improve for the future? (cc: > [~markflyhigh] [~sb2nov]) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1892) Log process during size estimation in filebasedsource
[ https://issues.apache.org/jira/browse/BEAM-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1892. - Resolution: Fixed Fix Version/s: Not applicable > Log process during size estimation in filebasedsource > - > > Key: BEAM-1892 > URL: https://issues.apache.org/jira/browse/BEAM-1892 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: Not applicable > > > http://stackoverflow.com/questions/43095445/how-to-iterate-all-files-in-google-cloud-storage-to-be-used-as-dataflow-input > The user mentioned that there was no output and a huge delay in submitting > the pipeline. The file size estimation process can be slow for really large > datasets and this reports no process to the end user right now. We should be > logging process and thresholding the pre submission size estimation as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1892) Log process during size estimation in filebasedsource
Sourabh Bajaj created BEAM-1892: --- Summary: Log process during size estimation in filebasedsource Key: BEAM-1892 URL: https://issues.apache.org/jira/browse/BEAM-1892 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj http://stackoverflow.com/questions/43095445/how-to-iterate-all-files-in-google-cloud-storage-to-be-used-as-dataflow-input The user mentioned that there was no output and a huge delay in submitting the pipeline. The file size estimation process can be slow for really large datasets and this reports no process to the end user right now. We should be logging process and thresholding the pre submission size estimation as well. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1711) Document extra features on quick start guide
[ https://issues.apache.org/jira/browse/BEAM-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954289#comment-15954289 ] Sourabh Bajaj commented on BEAM-1711: - This can be closed now. > Document extra features on quick start guide > > > Key: BEAM-1711 > URL: https://issues.apache.org/jira/browse/BEAM-1711 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > > Add something like below to avoid confusion > """ > You may need extra packages for some additional features and this is the list > of extra_features and what they do. > feature1: required for x, y, z > """ -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1441) Add an IOChannelFactory interface to Python SDK
[ https://issues.apache.org/jira/browse/BEAM-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953980#comment-15953980 ] Sourabh Bajaj commented on BEAM-1441: - This can be closed now. > Add an IOChannelFactory interface to Python SDK > --- > > Key: BEAM-1441 > URL: https://issues.apache.org/jira/browse/BEAM-1441 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Reporter: Chamikara Jayalath >Assignee: Sourabh Bajaj > Fix For: First stable release > > > Based on proposal [1], an IOChannelFactory interface was added to Java SDK > [2]. > We should add a similar interface to Python SDK and provide proper > implementations for native files, GCS, and other useful formats. > Python SDK currently has a basic ChannelFactory interface [3] which is used > by FileBasedSource [4]. > [1] > https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w > [2] > https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/IOChannelFactory.java > [3] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L107 > [4] > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1264) Python ChannelFactory Raise Inconsistent Error for Local FS and GCS
[ https://issues.apache.org/jira/browse/BEAM-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953978#comment-15953978 ] Sourabh Bajaj commented on BEAM-1264: - Yes should be closed now as BFS is merged now. > Python ChannelFactory Raise Inconsistent Error for Local FS and GCS > --- > > Key: BEAM-1264 > URL: https://issues.apache.org/jira/browse/BEAM-1264 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Mark Liu >Assignee: Sourabh Bajaj > > ChannelFactory raises different errors for local fs (RuntimeError) and GCS > (IOError) when reading failed. > We want to return consistent error for both. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1793) Frequent python post commit errors
[ https://issues.apache.org/jira/browse/BEAM-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943806#comment-15943806 ] Sourabh Bajaj commented on BEAM-1793: - Should be ready to close now. > Frequent python post commit errors > -- > > Key: BEAM-1793 > URL: https://issues.apache.org/jira/browse/BEAM-1793 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj >Priority: Critical > > 1. Failed because virtualenv was already installed. And postcommit script > made a wrong assumption about its location. > https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1499/consoleFull > 2. Failed because a really old version of pip is installed: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1585/consoleFull > (Also #1586 and #1597) > 3. Failed because a file was already in the cache and pip did not want to > override it, even though this is not usually a problem: > https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1596/consoleFull > (Might be related: https://issues.apache.org/jira/browse/BEAM-1788) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1817) Python environment is different on some nodes
Sourabh Bajaj created BEAM-1817: --- Summary: Python environment is different on some nodes Key: BEAM-1817 URL: https://issues.apache.org/jira/browse/BEAM-1817 Project: Beam Issue Type: Bug Components: testing Reporter: Sourabh Bajaj Assignee: Jason Kuster We've pined the python post commits to specific nodes as pip in some of the new nodes seems to be installed differently from the ones before. https://github.com/apache/beam/pull/2308/files Please remove the changes in above PR once this is resolved. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-541) Add more documentation on DoFn Annotations
[ https://issues.apache.org/jira/browse/BEAM-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939234#comment-15939234 ] Sourabh Bajaj commented on BEAM-541: Yes I can help with the documentation of the NewDoFn in Python. > Add more documentation on DoFn Annotations > -- > > Key: BEAM-541 > URL: https://issues.apache.org/jira/browse/BEAM-541 > Project: Beam > Issue Type: Wish > Components: website >Reporter: Frances Perry >Assignee: Kenneth Knowles >Priority: Minor > Fix For: First stable release > > > https://github.com/apache/incubator-beam-site/pull/36 made the basic > documentation changes that correspond to BEAM-498, but we should add more > details on how to use the advance configurations for window access, etc. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1730) Python post commits timing out
[ https://issues.apache.org/jira/browse/BEAM-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930762#comment-15930762 ] Sourabh Bajaj commented on BEAM-1730: - Please close > Python post commits timing out > -- > > Key: BEAM-1730 > URL: https://issues.apache.org/jira/browse/BEAM-1730 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Fix For: Not applicable > > > https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1514/ > Dataflow worker logs: > https://pantheon.corp.google.com/logs/viewer?logName=projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup&project=apache-beam-testing&resource=dataflow_step%2Fjob_id%2F2017-03-15_10_30_20-1232538583050549167&minLogLevel=0&expandAll=false×tamp=2017-03-15T19:15:33.0Z > Stack trace: > I Executing: /usr/bin/python -m dataflow_worker.start > -Djob_id=2017-03-15_10_30_20-1232538583050549167 > -Dproject_id=apache-beam-testing -Dreporting_enabled=True > -Droot_url=https://dataflow.googleapis.com > -Dservice_path=https://dataflow.googleapis.com/ > -Dtemp_gcs_directory=gs://unused > -Dworker_id=py-validatesrunner-148959-03151030-f51b-harness-t79h > -Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log > -Dlocal_staging_directory=/var/opt/google/dataflow > -Dsdk_pipeline_options={"display_data":[{"key":"requirements_file","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"sdk_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"dist/apache-beam-0.7.0.dev0.tar.gz"},{"key":"num_workers","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"runner","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"staging_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"project","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"temp_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"job_name","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"py-validatesrunner-1489599007"}],"options":{"autoscalingAlgorithm":"NONE","dataflowJobId":"2017-03-15_10_30_20-1232538583050549167","dataflow_endpoint":"https://dataflow.googleapis.com","direct_runner_use_stacked_bundle":true,"gcpTempLocation":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","job_name":"py-validatesrunner-1489599007","male":false,"maxNumWorkers":0,"mock_flag":false,"no_auth":false,"numWorkers":1,"num_workers":1,"pipeline_type_check":true,"profile_cpu":false,"profile_memory":false,"project":"apache-beam-testing","requirements_file":"postcommit_requirements.txt","runner":"TestDataflowRunner","runtime_type_check":false,"save_main_session":false,"sdk_location":"dist/apache-beam-0.7.0.dev0.tar.gz","staging_location":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","streaming":false,"style":"scrambled","temp_location":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","type_check_strictness":"DEFAULT_TO_ANY"}} > > I Traceback (most recent call last): > IFile "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main > I > I "__main__", fname, loader, pkg_name) > IFile "/usr/lib/python2.7/runpy.py", line 72, in _run_code > I > I exec code in run_globals > IFile "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", > line 26, in > I > I from dataflow_worker import batchworker > IFile > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 51, in > I > I from apache_beam.internal import pickler > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/__init__.py", > line 85, in > I > I __version__ = version.get_version() > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", > line 30, in get_version > I > I __version__ = get_version_from_pom() > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", > line 36, in get_version_from_pom
[jira] [Commented] (BEAM-1365) Bigquery dataset names should allow for . inside them
[ https://issues.apache.org/jira/browse/BEAM-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930760#comment-15930760 ] Sourabh Bajaj commented on BEAM-1365: - We didn't reach a conclusion on if we want it or not. Can you mark this as "Won't Fix" > Bigquery dataset names should allow for . inside them > - > > Key: BEAM-1365 > URL: https://issues.apache.org/jira/browse/BEAM-1365 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > > Bigquery datasets allow for . inside them but the regex in > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/bigquery.py#L305 > doesn't allow for this. > Change the regex to > r'^((?P.+):)?(?P[\w\.]+)\.(?P[\w\$]+)$', table) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-693) pydoc is not working
[ https://issues.apache.org/jira/browse/BEAM-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930756#comment-15930756 ] Sourabh Bajaj commented on BEAM-693: Yes > pydoc is not working > > > Key: BEAM-693 > URL: https://issues.apache.org/jira/browse/BEAM-693 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj >Priority: Minor > > Repro: > Start the pydoc server (pydoc -p ) and navigate to the apache_beam root: > http://localhost:/apache_beam.html > Following errors are shown instead of the actual documentation: > problem in apache_beam - : No module named avro > problem in apache_beam - : cannot import name > coders -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-804) Python Pipeline Option save_main_session non-functional
[ https://issues.apache.org/jira/browse/BEAM-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930759#comment-15930759 ] Sourabh Bajaj commented on BEAM-804: Yes > Python Pipeline Option save_main_session non-functional > --- > > Key: BEAM-804 > URL: https://issues.apache.org/jira/browse/BEAM-804 > Project: Beam > Issue Type: Bug > Components: sdk-py > Environment: OSX El Capitan, google-cloud-dataflow==0.4.3, python > 2.7.12 >Reporter: Zoran Bujila >Assignee: Sourabh Bajaj >Priority: Critical > > When trying to use the option --save_main_session a pickling error occurs. > pickle.PicklingError: Can't pickle 'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>: > it's not found as > apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnu > This prevents the use of this option which is desirable as there is an > expensive object that needs to be created on each worker in my pipeline and I > would like to have this object created only once per worker. It is not > practical to have it inline with the ParDo function unless I make the batch > size sent to the ParDo quite large. Doing this seems to lead to idle workers > and I would ideally want to bring the batch size way down. > The "Affects Version" option above doesn't have a 0.4.3 version in the drop > down so I did not populate it. However, this was a problem with 0.4.1 and has > not been corrected with 0.4.3. > I don't see where I can attach a file, so here is the entire error. > 2016-10-24 10:00:16,071 The > oauth2client.contrib.multistore_file module has been deprecated and will be > removed in the next release of oauth2client. Please migrate to > multiprocess_file_storage. > 2016-10-24 10:00:16,127 __init__Direct usage of TextFileSink is > deprecated. Please use 'textio.WriteToText()' instead of directly > instantiating a TextFileSink object. > Traceback (most recent call last): > File "00test.py", line 41, in > p.run() > File "/usr/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line > 159, in run > return self.runner.run(self) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/runners/dataflow_runner.py", > line 172, in run > self.dataflow_client.create_job(self.job)) > File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", > line 160, in wrapper > return fun(*args, **kwargs) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/internal/apiclient.py", > line 375, in create_job > job.options, file_copy=self._gcs_file_copy) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/utils/dependency.py", > line 325, in stage_job_resources > pickler.dump_session(pickled_session_file) > File > "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", > line 204, in dump_session > return dill.dump_session(file_path) > File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 333, in > dump_session > pickler.dump(main) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 224, in dump > self.save(obj) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 286, in save > f(self, obj) # Call unbound method with explicit self > File > "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", > line 123, in save_module > return old_save_module(pickler, obj) > File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 1168, in > save_module > state=_main_dict) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 425, in save_reduce > save(state) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 286, in save > f(self, obj) # Call unbound method with explicit self > File > "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", > line 159, in new_save_module_dict > return old_save_module_dict(pickler, obj) > File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 835, in > save_module_dict > StockPickler.save_dict(pickler, obj) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 655, in save_dict > self._batch_setitems(obj.iteritems()) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py", > line 687, in _batch_setitems > save(v) > File > "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/V
[jira] [Commented] (BEAM-782) Resolve runners in a case-insensitive manner.
[ https://issues.apache.org/jira/browse/BEAM-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930754#comment-15930754 ] Sourabh Bajaj commented on BEAM-782: Yes > Resolve runners in a case-insensitive manner. > -- > > Key: BEAM-782 > URL: https://issues.apache.org/jira/browse/BEAM-782 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > Labels: sdk-consistency > > See: > https://github.com/apache/incubator-beam/pull/1087 > https://issues.apache.org/jira/browse/BEAM-770 > e.g. the DirectRunner can be specified with (among others) any of > "--runner=direct", "--runner=directrunner", "--runner=DirectRunner", > "--runner=Direct", or "--runner=directRunner" -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-547) Align Python SDK version with Maven
[ https://issues.apache.org/jira/browse/BEAM-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927160#comment-15927160 ] Sourabh Bajaj commented on BEAM-547: We can not read the version directly from pom.xml as that is not part of the python package when a user does `pip install apache_beam` so I think we can simplify the process by just directly storing the version string in versions.py is probably best. > Align Python SDK version with Maven > --- > > Key: BEAM-547 > URL: https://issues.apache.org/jira/browse/BEAM-547 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Affects Versions: 0.3.0-incubating >Reporter: Sergio Fernández >Assignee: Frances Perry >Priority: Minor > Original Estimate: 24h > Remaining Estimate: 24h > > In BEAM-378 we've integrated the Python SDK in the main Maven build. > Initially I wanted to also align versions, but after discussing it with > [~silv...@google.com] we kept that aside for the moment. > Closing [PR #537|https://github.com/apache/incubator-beam/pull/537] [~altay] > brings the issue back. So it may make sense to revisit that idea. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1731) RuntimeError when running wordcount with ValueProviders
Sourabh Bajaj created BEAM-1731: --- Summary: RuntimeError when running wordcount with ValueProviders Key: BEAM-1731 URL: https://issues.apache.org/jira/browse/BEAM-1731 Project: Beam Issue Type: Bug Components: sdk-py Reporter: Sourabh Bajaj Assignee: María GH Running: python -m apache_beam.examples.wordcount INFO:root:Job 2017-03-15_13_39_59-3092873759767386 is in state JOB_STATE_FAILED Traceback (most recent call last): File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 162, in _run_module_as_main "__main__", fname, loader, pkg_name) File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 72, in _run_code exec code in run_globals File "/Users/sourabhbajaj/Projects/incubator-beam/sdks/python/apache_beam/examples/wordcount.py", line 119, in run() File "/Users/sourabhbajaj/Projects/incubator-beam/sdks/python/apache_beam/examples/wordcount.py", line 109, in run result.wait_until_finish() File "apache_beam/runners/dataflow/dataflow_runner.py", line 711, in wait_until_finish (self.state, getattr(self._runner, 'last_error_msg', None)), self) apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error: (e22fabbb61bfae00): Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 544, in do_work work_executor.execute() File "dataflow_worker/executor.py", line 1013, in dataflow_worker.executor.CustomSourceSplitExecutor.execute (dataflow_worker/executor.c:31501) self.response = self._perform_source_split_considering_api_limits( File "dataflow_worker/executor.py", line 1021, in dataflow_worker.executor.CustomSourceSplitExecutor._perform_source_split_considering_api_limits (dataflow_worker/executor.c:31703) split_response = self._perform_source_split(source_operation_split_task, File "dataflow_worker/executor.py", line 1059, in dataflow_worker.executor.CustomSourceSplitExecutor._perform_source_split (dataflow_worker/executor.c:32341) for split in source.split(desired_bundle_size): File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", line 192, in split return self._get_concat_source().split( File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/value_provider.py", line 105, in _f raise RuntimeError('%s not accessible' % obj) RuntimeError: RuntimeValueProvider(option: input, type: str, default_value: 'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1730) Python post commits timing out
[ https://issues.apache.org/jira/browse/BEAM-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926928#comment-15926928 ] Sourabh Bajaj commented on BEAM-1730: - Reverting the changes in the commit as this needs broader discussion. > Python post commits timing out > -- > > Key: BEAM-1730 > URL: https://issues.apache.org/jira/browse/BEAM-1730 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > > https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1514/ > Dataflow worker logs: > https://pantheon.corp.google.com/logs/viewer?logName=projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup&project=apache-beam-testing&resource=dataflow_step%2Fjob_id%2F2017-03-15_10_30_20-1232538583050549167&minLogLevel=0&expandAll=false×tamp=2017-03-15T19:15:33.0Z > Stack trace: > I Executing: /usr/bin/python -m dataflow_worker.start > -Djob_id=2017-03-15_10_30_20-1232538583050549167 > -Dproject_id=apache-beam-testing -Dreporting_enabled=True > -Droot_url=https://dataflow.googleapis.com > -Dservice_path=https://dataflow.googleapis.com/ > -Dtemp_gcs_directory=gs://unused > -Dworker_id=py-validatesrunner-148959-03151030-f51b-harness-t79h > -Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log > -Dlocal_staging_directory=/var/opt/google/dataflow > -Dsdk_pipeline_options={"display_data":[{"key":"requirements_file","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"sdk_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"dist/apache-beam-0.7.0.dev0.tar.gz"},{"key":"num_workers","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"runner","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"staging_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"project","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"temp_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"job_name","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"py-validatesrunner-1489599007"}],"options":{"autoscalingAlgorithm":"NONE","dataflowJobId":"2017-03-15_10_30_20-1232538583050549167","dataflow_endpoint":"https://dataflow.googleapis.com","direct_runner_use_stacked_bundle":true,"gcpTempLocation":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","job_name":"py-validatesrunner-1489599007","male":false,"maxNumWorkers":0,"mock_flag":false,"no_auth":false,"numWorkers":1,"num_workers":1,"pipeline_type_check":true,"profile_cpu":false,"profile_memory":false,"project":"apache-beam-testing","requirements_file":"postcommit_requirements.txt","runner":"TestDataflowRunner","runtime_type_check":false,"save_main_session":false,"sdk_location":"dist/apache-beam-0.7.0.dev0.tar.gz","staging_location":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","streaming":false,"style":"scrambled","temp_location":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","type_check_strictness":"DEFAULT_TO_ANY"}} > > I Traceback (most recent call last): > IFile "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main > I > I "__main__", fname, loader, pkg_name) > IFile "/usr/lib/python2.7/runpy.py", line 72, in _run_code > I > I exec code in run_globals > IFile "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", > line 26, in > I > I from dataflow_worker import batchworker > IFile > "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line > 51, in > I > I from apache_beam.internal import pickler > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/__init__.py", > line 85, in > I > I __version__ = version.get_version() > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", > line 30, in get_version > I > I __version__ = get_version_from_pom() > IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", > line 36, in get
[jira] [Created] (BEAM-1708) Better error messages when GCP features are not installed
Sourabh Bajaj created BEAM-1708: --- Summary: Better error messages when GCP features are not installed Key: BEAM-1708 URL: https://issues.apache.org/jira/browse/BEAM-1708 Project: Beam Issue Type: Bug Components: sdk-py Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1320) Add sphinx or pydocs documentation for python-sdk
[ https://issues.apache.org/jira/browse/BEAM-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893064#comment-15893064 ] Sourabh Bajaj commented on BEAM-1320: - [~altay] can you resolve this ? > Add sphinx or pydocs documentation for python-sdk > - > > Key: BEAM-1320 > URL: https://issues.apache.org/jira/browse/BEAM-1320 > Project: Beam > Issue Type: Task > Components: sdk-py >Reporter: Ahmet Altay >Assignee: Sourabh Bajaj > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1570) broken link on website (ostatic.com)
[ https://issues.apache.org/jira/browse/BEAM-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893060#comment-15893060 ] Sourabh Bajaj commented on BEAM-1570: - [~sisk] can you resolve this ? > broken link on website (ostatic.com) > > > Key: BEAM-1570 > URL: https://issues.apache.org/jira/browse/BEAM-1570 > Project: Beam > Issue Type: Bug > Components: website >Reporter: Stephen Sisk >Assignee: Sourabh Bajaj > > I see the following error when running the website test suite locally and on > jenkins: > * External link > http://ostatic.com/blog/apache-beam-unifies-batch-and-streaming-data-processing > failed: got a time out (response code 0) > It looks like ostatic's website is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1264) Python ChannelFactory Raise Inconsistent Error for Local FS and GCS
[ https://issues.apache.org/jira/browse/BEAM-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893057#comment-15893057 ] Sourabh Bajaj commented on BEAM-1264: - [~markflyhigh] do you have an example for this. I think I have resolved most of it in https://github.com/apache/beam/pull/2136 but wanted to verify it before resolving. > Python ChannelFactory Raise Inconsistent Error for Local FS and GCS > --- > > Key: BEAM-1264 > URL: https://issues.apache.org/jira/browse/BEAM-1264 > Project: Beam > Issue Type: Bug > Components: beam-model, sdk-py >Reporter: Mark Liu >Assignee: Sourabh Bajaj > > ChannelFactory raises different errors for local fs (RuntimeError) and GCS > (IOError) when reading failed. > We want to return consistent error for both. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-994) Support for S3 file source and sink
[ https://issues.apache.org/jira/browse/BEAM-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-994: -- Assignee: (was: Sourabh Bajaj) > Support for S3 file source and sink > --- > > Key: BEAM-994 > URL: https://issues.apache.org/jira/browse/BEAM-994 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Sourabh Bajaj >Priority: Minor > > http://stackoverflow.com/questions/40624544/what-is-best-practice-of-the-the-case-of-writing-text-output-into-s3-bucket > is one of the examples of the need for such a feature. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (BEAM-1082) Use use_standard_sql flag everywhere instead of use_legacy_sql
[ https://issues.apache.org/jira/browse/BEAM-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj closed BEAM-1082. --- Resolution: Won't Fix Fix Version/s: Not applicable It was decided to leave this as it is. > Use use_standard_sql flag everywhere instead of use_legacy_sql > -- > > Key: BEAM-1082 > URL: https://issues.apache.org/jira/browse/BEAM-1082 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > Fix For: Not applicable > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (BEAM-994) Support for S3 file source and sink
[ https://issues.apache.org/jira/browse/BEAM-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj reassigned BEAM-994: -- Assignee: Sourabh Bajaj > Support for S3 file source and sink > --- > > Key: BEAM-994 > URL: https://issues.apache.org/jira/browse/BEAM-994 > Project: Beam > Issue Type: New Feature > Components: sdk-java-extensions >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > > http://stackoverflow.com/questions/40624544/what-is-best-practice-of-the-the-case-of-writing-text-output-into-s3-bucket > is one of the examples of the need for such a feature. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1528) Remove all sphinx warnings from the pydocs
[ https://issues.apache.org/jira/browse/BEAM-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1528. - Resolution: Fixed > Remove all sphinx warnings from the pydocs > -- > > Key: BEAM-1528 > URL: https://issues.apache.org/jira/browse/BEAM-1528 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Ahmet Altay >Priority: Minor > Fix For: Not applicable > > > There are still 4 open warnings in the pydoc generation which should be > fixed. Once this is over please add a test to the generation script that > changes all warnings to errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1528) Remove all sphinx warnings from the pydocs
[ https://issues.apache.org/jira/browse/BEAM-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892896#comment-15892896 ] Sourabh Bajaj commented on BEAM-1528: - Yes, I'll resolve it. > Remove all sphinx warnings from the pydocs > -- > > Key: BEAM-1528 > URL: https://issues.apache.org/jira/browse/BEAM-1528 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Ahmet Altay >Priority: Minor > Fix For: Not applicable > > > There are still 4 open warnings in the pydoc generation which should be > fixed. Once this is over please add a test to the generation script that > changes all warnings to errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk
[ https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1585: Description: BEAM-1441 implements the new BeamFileSystem in the python SDK but currently lacks the ability to add user implemented file systems. This needs to be executed in the worker so should be packaged correctly with the pipeline code. > Ability to add new file systems to beamFS in the python sdk > --- > > Key: BEAM-1585 > URL: https://issues.apache.org/jira/browse/BEAM-1585 > Project: Beam > Issue Type: New Feature > Components: sdk-py >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj > > BEAM-1441 implements the new BeamFileSystem in the python SDK but currently > lacks the ability to add user implemented file systems. > This needs to be executed in the worker so should be packaged correctly with > the pipeline code. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk
Sourabh Bajaj created BEAM-1585: --- Summary: Ability to add new file systems to beamFS in the python sdk Key: BEAM-1585 URL: https://issues.apache.org/jira/browse/BEAM-1585 Project: Beam Issue Type: New Feature Components: sdk-py Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1570) broken link on website (ostatic.com)
[ https://issues.apache.org/jira/browse/BEAM-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890755#comment-15890755 ] Sourabh Bajaj commented on BEAM-1570: - This is fixed. > broken link on website (ostatic.com) > > > Key: BEAM-1570 > URL: https://issues.apache.org/jira/browse/BEAM-1570 > Project: Beam > Issue Type: Bug > Components: website >Reporter: Stephen Sisk >Assignee: Sourabh Bajaj > > I see the following error when running the website test suite locally and on > jenkins: > * External link > http://ostatic.com/blog/apache-beam-unifies-batch-and-streaming-data-processing > failed: got a time out (response code 0) > It looks like ostatic's website is down. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1545) Python sdk example run failed
[ https://issues.apache.org/jira/browse/BEAM-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890756#comment-15890756 ] Sourabh Bajaj commented on BEAM-1545: - This is fixed so we should close this. > Python sdk example run failed > - > > Key: BEAM-1545 > URL: https://issues.apache.org/jira/browse/BEAM-1545 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Haoxiang >Assignee: Sourabh Bajaj > > When I run the python sdk example with > https://beam.apache.org/get-started/quickstart-py/ show, run the command: > python -m apache_beam.examples.wordcount --input > gs://dataflow-samples/shakespeare/kinglear.txt --output output.txt > it was failed by the logs: > INFO:root:Missing pipeline option (runner). Executing pipeline using the > default runner: DirectRunner. > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 162, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File > "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py", > line 107, in > run() > File > "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py", > line 83, in run > lines = p | 'read' >> ReadFromText(known_args.input) > File "apache_beam/io/textio.py", line 378, in __init__ > skip_header_lines=skip_header_lines) > File "apache_beam/io/textio.py", line 87, in __init__ > validate=validate) > File "apache_beam/io/filebasedsource.py", line 97, in __init__ > self._validate() > File "apache_beam/io/filebasedsource.py", line 171, in _validate > if len(fileio.ChannelFactory.glob(self._pattern, limit=1)) <= 0: > File "apache_beam/io/fileio.py", line 281, in glob > return gcsio.GcsIO().glob(path, limit) > AttributeError: 'NoneType' object has no attribute 'GcsIO' -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (BEAM-1545) Python sdk example run failed
[ https://issues.apache.org/jira/browse/BEAM-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883455#comment-15883455 ] Sourabh Bajaj commented on BEAM-1545: - You should run `pip install -e .[gcp,test]` instead of the python setup.py install as Ahmet suggested. That should fix the issue you're facing. > Python sdk example run failed > - > > Key: BEAM-1545 > URL: https://issues.apache.org/jira/browse/BEAM-1545 > Project: Beam > Issue Type: Bug > Components: sdk-py >Reporter: Haoxiang >Assignee: Sourabh Bajaj > > When I run the python sdk example with > https://beam.apache.org/get-started/quickstart-py/ show, run the command: > python -m apache_beam.examples.wordcount --input > gs://dataflow-samples/shakespeare/kinglear.txt --output output.txt > it was failed by the logs: > INFO:root:Missing pipeline option (runner). Executing pipeline using the > default runner: DirectRunner. > Traceback (most recent call last): > File > "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 162, in _run_module_as_main > "__main__", fname, loader, pkg_name) > File > "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", > line 72, in _run_code > exec code in run_globals > File > "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py", > line 107, in > run() > File > "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py", > line 83, in run > lines = p | 'read' >> ReadFromText(known_args.input) > File "apache_beam/io/textio.py", line 378, in __init__ > skip_header_lines=skip_header_lines) > File "apache_beam/io/textio.py", line 87, in __init__ > validate=validate) > File "apache_beam/io/filebasedsource.py", line 97, in __init__ > self._validate() > File "apache_beam/io/filebasedsource.py", line 171, in _validate > if len(fileio.ChannelFactory.glob(self._pattern, limit=1)) <= 0: > File "apache_beam/io/fileio.py", line 281, in glob > return gcsio.GcsIO().glob(path, limit) > AttributeError: 'NoneType' object has no attribute 'GcsIO' -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1528) Remove all sphinx warnings from the pydocs
Sourabh Bajaj created BEAM-1528: --- Summary: Remove all sphinx warnings from the pydocs Key: BEAM-1528 URL: https://issues.apache.org/jira/browse/BEAM-1528 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay Priority: Minor Fix For: Not applicable There are still 4 open warnings in the pydoc generation which should be fixed. Once this is over please add a test to the generation script that changes all warnings to errors. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1527) Check if the dependencies protorpc and python-gflags can be removed
Sourabh Bajaj created BEAM-1527: --- Summary: Check if the dependencies protorpc and python-gflags can be removed Key: BEAM-1527 URL: https://issues.apache.org/jira/browse/BEAM-1527 Project: Beam Issue Type: Improvement Components: sdk-py Reporter: Sourabh Bajaj Assignee: Ahmet Altay Priority: Minor Fix For: Not applicable These dependencies might be unused and should be removed in that case. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1509) Python sdk pom runs only on unix
Sourabh Bajaj created BEAM-1509: --- Summary: Python sdk pom runs only on unix Key: BEAM-1509 URL: https://issues.apache.org/jira/browse/BEAM-1509 Project: Beam Issue Type: Bug Components: sdk-py Affects Versions: Not applicable Reporter: Sourabh Bajaj Assignee: Ahmet Altay Priority: Minor https://github.com/apache/beam/blob/master/sdks/python/pom.xml#L133 There are lines like this in the pom.xml file in sdks/python which won't work on windows OS. So we should investigate running this correctly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1473) Remove unused windmill proto files from python sdk
[ https://issues.apache.org/jira/browse/BEAM-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1473. - Resolution: Fixed > Remove unused windmill proto files from python sdk > -- > > Key: BEAM-1473 > URL: https://issues.apache.org/jira/browse/BEAM-1473 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Affects Versions: Not applicable >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: Not applicable > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py > > There are two unused windmill files in beam that should be cleaned. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (BEAM-1473) Remove unused windmill proto files from python sdk
[ https://issues.apache.org/jira/browse/BEAM-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj updated BEAM-1473: Description: https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py There are two unused windmill files in beam that should be cleaned. > Remove unused windmill proto files from python sdk > -- > > Key: BEAM-1473 > URL: https://issues.apache.org/jira/browse/BEAM-1473 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Affects Versions: Not applicable >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: Not applicable > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py > > There are two unused windmill files in beam that should be cleaned. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1473) Remove unused windmill proto files from python sdk
Sourabh Bajaj created BEAM-1473: --- Summary: Remove unused windmill proto files from python sdk Key: BEAM-1473 URL: https://issues.apache.org/jira/browse/BEAM-1473 Project: Beam Issue Type: Improvement Components: sdk-py Affects Versions: Not applicable Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj Priority: Minor Fix For: Not applicable -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1464) Make sure all tests use the TestPipeline class
[ https://issues.apache.org/jira/browse/BEAM-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1464. - Resolution: Fixed > Make sure all tests use the TestPipeline class > -- > > Key: BEAM-1464 > URL: https://issues.apache.org/jira/browse/BEAM-1464 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Affects Versions: Not applicable >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: Not applicable > > > Some tests still specify using the DirectRunner. They should all use the > TestRunner in those cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (BEAM-1464) Make sure all tests use the TestPipeline class
Sourabh Bajaj created BEAM-1464: --- Summary: Make sure all tests use the TestPipeline class Key: BEAM-1464 URL: https://issues.apache.org/jira/browse/BEAM-1464 Project: Beam Issue Type: Improvement Components: sdk-py Affects Versions: Not applicable Reporter: Sourabh Bajaj Assignee: Sourabh Bajaj Priority: Minor Fix For: Not applicable Some tests still specify using the DirectRunner. They should all use the TestRunner in those cases. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1430) PartitionFn should access only the element and not the context
[ https://issues.apache.org/jira/browse/BEAM-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1430. - Resolution: Fixed Fix Version/s: Not applicable > PartitionFn should access only the element and not the context > -- > > Key: BEAM-1430 > URL: https://issues.apache.org/jira/browse/BEAM-1430 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Affects Versions: Not applicable >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: Not applicable > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L1182 > should only access the element and not the full context. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (BEAM-1429) WindowFn assign should only have access to the element and timestamp
[ https://issues.apache.org/jira/browse/BEAM-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sourabh Bajaj resolved BEAM-1429. - Resolution: Fixed Fix Version/s: Not applicable > WindowFn assign should only have access to the element and timestamp > > > Key: BEAM-1429 > URL: https://issues.apache.org/jira/browse/BEAM-1429 > Project: Beam > Issue Type: Improvement > Components: sdk-py >Affects Versions: Not applicable >Reporter: Sourabh Bajaj >Assignee: Sourabh Bajaj >Priority: Minor > Fix For: Not applicable > > > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/window.py#L97 > does not really use access to multiple windows at the same time. > Also as we move to the NewDoFn we have moved away from accessing the > windowSet together. Change the function here to only take the element and > timestamp as the input. -- This message was sent by Atlassian JIRA (v6.3.15#6346)