[jira] [Assigned] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts

2017-11-05 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-2271:
---

Assignee: Ahmet Altay  (was: Sourabh Bajaj)

> Release guide or pom.xml needs update to avoid releasing Python binary 
> artifacts
> 
>
> Key: BEAM-2271
> URL: https://issues.apache.org/jira/browse/BEAM-2271
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py-core
>Reporter: Daniel Halperin
>Assignee: Ahmet Altay
> Fix For: 2.3.0
>
>
> The following directories (and children) were discovered in 2.0.0-RC2 and 
> were present in 0.6.0.
> {code}
> sdks/python: build   dist.eggs   nose-1.3.7-py2.7.egg  (and child 
> contents)
> {code}
> Ideally, these artifacts, which are created during setup and testing, would 
> get created in the {{sdks/python/target/}} subfolder where they will 
> automatically get ignored. More info below.
> For 2.0.0, we will manually remove these files from the source release RC3+. 
> This should be fixed before the next release.
> Here is a list of other paths that get excluded, should they be useful.
> {code}
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*]
> 
> 
>  
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties]
>   
> {code}
> This list is stored inside of this jar, which you can find by tracking 
> maven-assembly-plugin from the root apache pom: 
> https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6
> http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Closed] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk

2017-11-05 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj closed BEAM-1585.
---

> Ability to add new file systems to beamFS in the python sdk
> ---
>
> Key: BEAM-1585
> URL: https://issues.apache.org/jira/browse/BEAM-1585
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py-core
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: Not applicable
>
>
> BEAM-1441 implements the new BeamFileSystem in the python SDK but currently 
> lacks the ability to add user implemented file systems.
> This needs to be executed in the worker so should be packaged correctly with 
> the pipeline code. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2673) BigQuery Sink should use the Load API

2017-07-24 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099103#comment-16099103
 ] 

Sourabh Bajaj commented on BEAM-2673:
-

No as there is no Truncate mode in streaming, you can only append or create a 
new table.

> BigQuery Sink should use the Load API
> -
>
> Key: BEAM-2673
> URL: https://issues.apache.org/jira/browse/BEAM-2673
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Ahmet Altay
>
> Currently the BigQuery sink is written to by using the streaming api in the 
> direct runner. Instead we should just use the load api and also simplify the 
> management of different create and write disposition. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2673) BigQuery Sink should use the Load API

2017-07-24 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2673:
---

 Summary: BigQuery Sink should use the Load API
 Key: BEAM-2673
 URL: https://issues.apache.org/jira/browse/BEAM-2673
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay


Currently the BigQuery sink is written to by using the streaming api in the 
direct runner. Instead we should just use the load api and also simplify the 
management of different create and write disposition. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2636) user_score on DataflowRunner is broken

2017-07-19 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16093792#comment-16093792
 ] 

Sourabh Bajaj commented on BEAM-2636:
-

This can be resolved now.

> user_score on DataflowRunner is broken
> --
>
> Key: BEAM-2636
> URL: https://issues.apache.org/jira/browse/BEAM-2636
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>
> UserScore has a custom transform named {{WriteToBigQuery}}, dataflow runner 
> has a special code handling transforms with that name, this will break for 
> all user transforms that has this name.
> We can either:
> - Handle this correctly
> - Or document this as a reserved keyword and change the example.
> cc: [~chamikara]



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts

2017-07-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16090671#comment-16090671
 ] 

Sourabh Bajaj commented on BEAM-2271:
-

https://github.com/apache/beam/pull/3441 creates a new zip which omits the 
{{.tox }} files etc. but I haven't been able to figure out how to override the 
actual source release file that is being created as I only see one entry for 
the execution.

[~jbonofre] do you have any ideas on what I might be doing wrong in the PR. 

> Release guide or pom.xml needs update to avoid releasing Python binary 
> artifacts
> 
>
> Key: BEAM-2271
> URL: https://issues.apache.org/jira/browse/BEAM-2271
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Daniel Halperin
>Assignee: Sourabh Bajaj
> Fix For: 2.2.0
>
>
> The following directories (and children) were discovered in 2.0.0-RC2 and 
> were present in 0.6.0.
> {code}
> sdks/python: build   dist.eggs   nose-1.3.7-py2.7.egg  (and child 
> contents)
> {code}
> Ideally, these artifacts, which are created during setup and testing, would 
> get created in the {{sdks/python/target/}} subfolder where they will 
> automatically get ignored. More info below.
> For 2.0.0, we will manually remove these files from the source release RC3+. 
> This should be fixed before the next release.
> Here is a list of other paths that get excluded, should they be useful.
> {code}
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*]
> 
> 
>  
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties]
>   
> {code}
> This list is stored inside of this jar, which you can find by tracking 
> maven-assembly-plugin from the root apache pom: 
> https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6
> http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (BEAM-2271) Release guide or pom.xml needs update to avoid releasing Python binary artifacts

2017-07-17 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-2271:

Fix Version/s: (was: 2.1.0)
   2.2.0

> Release guide or pom.xml needs update to avoid releasing Python binary 
> artifacts
> 
>
> Key: BEAM-2271
> URL: https://issues.apache.org/jira/browse/BEAM-2271
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Daniel Halperin
>Assignee: Sourabh Bajaj
> Fix For: 2.2.0
>
>
> The following directories (and children) were discovered in 2.0.0-RC2 and 
> were present in 0.6.0.
> {code}
> sdks/python: build   dist.eggs   nose-1.3.7-py2.7.egg  (and child 
> contents)
> {code}
> Ideally, these artifacts, which are created during setup and testing, would 
> get created in the {{sdks/python/target/}} subfolder where they will 
> automatically get ignored. More info below.
> For 2.0.0, we will manually remove these files from the source release RC3+. 
> This should be fixed before the next release.
> Here is a list of other paths that get excluded, should they be useful.
> {code}
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/).*${project.build.directory}.*]
> 
> 
>  
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?maven-eclipse\.xml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.project]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.classpath]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iws]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.idea(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?out(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.ipr]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?[^/]*\.iml]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.settings(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.externalToolBuilders(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.deployables(/.*)?]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?\.wtpmodules(/.*)?]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?cobertura\.ser]
> 
> 
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?pom\.xml\.releaseBackup]
> 
> %regex[(?!((?!${project.build.directory}/)[^/]+/)*src/)(.*/)?release\.properties]
>   
> {code}
> This list is stored inside of this jar, which you can find by tracking 
> maven-assembly-plugin from the root apache pom: 
> https://mvnrepository.com/artifact/org.apache.apache.resources/apache-source-release-assembly-descriptor/1.0.6
> http://svn.apache.org/repos/asf/maven/pom/tags/apache-18/pom.xml



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2595) WriteToBigQuery does not work with nested json schema

2017-07-14 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16087712#comment-16087712
 ] 

Sourabh Bajaj commented on BEAM-2595:
-

[~andrea.pierleoni] can you verify that your pipeline works with the latest 
master ?

> WriteToBigQuery does not work with nested json schema
> -
>
> Key: BEAM-2595
> URL: https://issues.apache.org/jira/browse/BEAM-2595
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 2.1.0
> Environment: mac os local runner, Python
>Reporter: Andrea Pierleoni
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: gcp
> Fix For: 2.1.0
>
>
> I am trying to use the new `WriteToBigQuery` PTransform added to 
> `apache_beam.io.gcp.bigquery` in version 2.1.0-RC1
> I need to write to a bigquery table with nested fields.
> The only way to specify nested schemas in bigquery is with teh json schema.
> None of the classes in `apache_beam.io.gcp.bigquery` are able to parse the 
> json schema, but they accept a schema as an instance of the class 
> `apache_beam.io.gcp.internal.clients.bigquery.TableFieldSchema`
> I am composing the `TableFieldSchema` as suggested here 
> [https://stackoverflow.com/questions/36127537/json-table-schema-to-bigquery-tableschema-for-bigquerysink/45039436#45039436],
>  and it looks fine when passed to the PTransform `WriteToBigQuery`. 
> The problem is that the base class `PTransformWithSideInputs` try to pickle 
> and unpickle the function 
> [https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform.py#L515]
>   (that includes the TableFieldSchema instance) and for some reason when the 
> class is unpickled some `FieldList` instance are converted to simple lists, 
> and the pickling validation fails.
> Would it be possible to extend the test coverage to nested json objects for 
> bigquery?
> They are also relatively easy to parse into a TableFieldSchema.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082887#comment-16082887
 ] 

Sourabh Bajaj edited comment on BEAM-2572 at 7/11/17 8:16 PM:
--

I think overall this plan makes sense but step 4 might be of unclear difficulty 
here as it depends on the runners propagating the pipeline options to all the 
workers correctly. It would be great to start a thread on the dev mailing list 
around this because we don't have a very clear story for credential passing to 
PTransforms yet.


was (Author: sb2nov):
I think overall this plan makes sense but step 4 might be of unclear difficulty 
here as it depends on the runners propagating the pipeline options to all the 
workers correctly. It would be great to start a thread around this because we 
don't have a very clear story for credential passing to PTransforms yet.

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-11 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16082887#comment-16082887
 ] 

Sourabh Bajaj commented on BEAM-2572:
-

I think overall this plan makes sense but step 4 might be of unclear difficulty 
here as it depends on the runners propagating the pipeline options to all the 
workers correctly. It would be great to start a thread around this because we 
don't have a very clear story for credential passing to PTransforms yet.

> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-10 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145
 ] 

Sourabh Bajaj edited comment on BEAM-2573 at 7/10/17 9:03 PM:
--

I think the worker is specified in 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81
 so it is related to the version of the sdk being used.

You don't need the release if you're on master as I can run  the following 
command and see logs about failing to import the dataflow package on head and 
success for the builtin plugins.

{code:none}
python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project 
$PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp  
--job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location 
dist/apache-beam-2.1.0.dev0.tar.gz  --beam_plugin dataflow.s3.aws.S3FS
{code}


{code:none}
13:54:41.574
Failed to import beam plugin dataflow.s3.aws.S3FS
13:54:41.573
Successfully imported beam plugin apache_beam.io.filesystem.FileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem
{code}




was (Author: sb2nov):
I think the worker is specified in 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81
 so it is related to the version of the sdk being used.

I might be wrong about needing the release if you're on master as I can run  
the following command and see logs about failing to import the dataflow package 
on head and success for the builtin plugins.

{code:none}
python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project 
$PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp  
--job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location 
dist/apache-beam-2.1.0.dev0.tar.gz  --beam_plugin dataflow.s3.aws.S3FS
{code}


{code:none}
13:54:41.574
Failed to import beam plugin dataflow.s3.aws.S3FS
13:54:41.573
Successfully imported beam plugin apache_beam.io.filesystem.FileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem
{code}



> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.

[jira] [Comment Edited] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-10 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145
 ] 

Sourabh Bajaj edited comment on BEAM-2573 at 7/10/17 9:03 PM:
--

I think the worker is specified in 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81
 so it is related to the version of the sdk being used.

I might be wrong about needing the release if you're on master as I can run  
the following command and see logs about failing to import the dataflow package 
on head and success for the builtin plugins.

{code:none}
python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project 
$PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp  
--job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location 
dist/apache-beam-2.1.0.dev0.tar.gz  --beam_plugin dataflow.s3.aws.S3FS
{code}


{code:none}
13:54:41.574
Failed to import beam plugin dataflow.s3.aws.S3FS
13:54:41.573
Successfully imported beam plugin apache_beam.io.filesystem.FileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.localfilesystem.LocalFileSystem
13:54:41.573
Successfully imported beam plugin apache_beam.io.gcp.gcsfilesystem.GCSFileSystem
{code}




was (Author: sb2nov):
I think the worker is specified in 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81
 so it is related to the version of the sdk being used.

I might be wrong about needing the release if you're on master as I can run  
the following command and see logs about failing to import the dataflow package 
on head and success for the builtin plugins.

{code:shell}
python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project 
$PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp  
--job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location 
dist/apache-beam-2.1.0.dev0.tar.gz  --beam_plugin dataflow.s3.aws.S3FS
{code}


> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> 

[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-10 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16081145#comment-16081145
 ] 

Sourabh Bajaj commented on BEAM-2573:
-

I think the worker is specified in 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/dataflow/internal/dependency.py#L81
 so it is related to the version of the sdk being used.

I might be wrong about needing the release if you're on master as I can run  
the following command and see logs about failing to import the dataflow package 
on head and success for the builtin plugins.

{code:shell}
python -m apache_beam.examples.wordcount --output $BUCKET/wc/output --project 
$PROJECT --staging_location $BUCKET/wc/staging --temp_location $BUCKET/wc/tmp  
--job_name "sourabhbajaj-wc-4" --runner DataflowRunner --sdk_location 
dist/apache-beam-2.1.0.dev0.tar.gz  --beam_plugin dataflow.s3.aws.S3FS
{code}


> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 431, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented 
> (apache_beam/runners/common.c:11673)
> raise new_exn, None, original_traceback
>   File "apache_beam/runners/common.py", line 388, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10371)
> self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 281, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process 
> (apache_beam/runners/common.c:8626)
> self._invoke_per_window(windowed_value)
>   File "apache_beam/runners/common.py", line 307, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window 
> (apache_beam/runners/common.c:9302)
> windowed_value, self.process_method(*args_for_process))
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py",
>  line 749, in 
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py",
>  line 891, in 
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_b

[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-10 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16080974#comment-16080974
 ] 

Sourabh Bajaj commented on BEAM-2573:
-

I think the plugins are working correctly if they are passing a list of class 
names to be imported at the start. You might need to wait for the next release 
as this required a change to the dataflow workers as they need to start 
importing the paths specified in the beam-plugins list. There is a release 
going on right now so that might happen in the next few days itself.

I am not sure about the crash loop in windmillio  [~charleschen] might know 
more.



> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 431, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented 
> (apache_beam/runners/common.c:11673)
> raise new_exn, None, original_traceback
>   File "apache_beam/runners/common.py", line 388, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10371)
> self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 281, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process 
> (apache_beam/runners/common.c:8626)
> self._invoke_per_window(windowed_value)
>   File "apache_beam/runners/common.py", line 307, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window 
> (apache_beam/runners/common.c:9302)
> windowed_value, self.process_method(*args_for_process))
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/transforms/core.py",
>  line 749, in 
>   File 
> "/Users/dmitrydemeshchuk/miniconda2/lib/python2.7/site-packages/apache_beam/io/iobase.py",
>  line 891, in 
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/options/value_provider.py",
>  line 109, in _f
> return fnc(self, *args, **kwargs)
>   File 
> "/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsink.py", 
> line 146, in initialize_write
> tmp_dir = self._create_temp_dir(file_path_prefix)
> 

[jira] [Commented] (BEAM-2573) Better filesystem discovery mechanism in Python SDK

2017-07-09 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079848#comment-16079848
 ] 

Sourabh Bajaj commented on BEAM-2573:
-

[~chamikara]
The issue with the register approach is that it needs to run in the python 
process of the worker so something from pipeline construction needs to tell the 
worker process to import these files and we use beam plugins as a transport 
mechanism for that. 
If you're doing register in the pipeline construction you'll have to import the 
class anyway so that kind of makes the registration pointless as the import 
itself would make the FS available to the subclasses module.

[~demeshchuk] Can you try importing dataflow in your pipeline submission 
process and see if that makes the plugin available. You can just look at the 
pipeline options to see if it worked correctly as it should try to pass the 
full S3FS path in the --beam-plugins option.

Ideally I think the best way to do this would have been changing the Path to be 
FS specific and the ReadFromText would take in a specialized S3Path object but 
that is a really large breaking change so we have to live with this.

> Better filesystem discovery mechanism in Python SDK
> ---
>
> Key: BEAM-2573
> URL: https://issues.apache.org/jira/browse/BEAM-2573
> Project: Beam
>  Issue Type: Task
>  Components: runner-dataflow, sdk-py
>Affects Versions: 2.0.0
>Reporter: Dmitry Demeshchuk
>Priority: Minor
>
> It looks like right now custom filesystem classes have to be imported 
> explicitly: 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filesystems.py#L30
> Seems like the current implementation doesn't allow discovering filesystems 
> that come from side packages, not from apache_beam itself. Even if I put a 
> custom FileSystem-inheriting class into a package and explicitly import it in 
> the root __init__.py of that package, it still doesn't make the class 
> discoverable.
> The problems I'm experiencing happen on Dataflow runner, while Direct runner 
> works just fine. Here's an example of Dataflow output:
> {code}
>   (320418708fe777d7): Traceback (most recent call last):
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 581, in do_work
> work_executor.execute()
>   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py", 
> line 166, in execute
> op.start()
>   File 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/native_operations.py",
>  line 54, in start
> self.output(windowed_value)
>   File "dataflow_worker/operations.py", line 138, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5768)
> def output(self, windowed_value, output_index=0):
>   File "dataflow_worker/operations.py", line 139, in 
> dataflow_worker.operations.Operation.output 
> (dataflow_worker/operations.c:5654)
> cython.cast(Receiver, 
> self.receivers[output_index]).receive(windowed_value)
>   File "dataflow_worker/operations.py", line 72, in 
> dataflow_worker.operations.ConsumerSet.receive 
> (dataflow_worker/operations.c:3506)
> cython.cast(Operation, consumer).process(windowed_value)
>   File "dataflow_worker/operations.py", line 328, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:11162)
> with self.scoped_process_state:
>   File "dataflow_worker/operations.py", line 329, in 
> dataflow_worker.operations.DoOperation.process 
> (dataflow_worker/operations.c:6)
> self.dofn_receiver.receive(o)
>   File "apache_beam/runners/common.py", line 382, in 
> apache_beam.runners.common.DoFnRunner.receive 
> (apache_beam/runners/common.c:10156)
> self.process(windowed_value)
>   File "apache_beam/runners/common.py", line 390, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10458)
> self._reraise_augmented(exn)
>   File "apache_beam/runners/common.py", line 431, in 
> apache_beam.runners.common.DoFnRunner._reraise_augmented 
> (apache_beam/runners/common.c:11673)
> raise new_exn, None, original_traceback
>   File "apache_beam/runners/common.py", line 388, in 
> apache_beam.runners.common.DoFnRunner.process 
> (apache_beam/runners/common.c:10371)
> self.do_fn_invoker.invoke_process(windowed_value)
>   File "apache_beam/runners/common.py", line 281, in 
> apache_beam.runners.common.PerWindowInvoker.invoke_process 
> (apache_beam/runners/common.c:8626)
> self._invoke_per_window(windowed_value)
>   File "apache_beam/runners/common.py", line 307, in 
> apache_beam.runners.common.PerWindowInvoker._invoke_per_window 
> (apache_beam/runners/common.c:9302)
> windowed_value, self.process_method(*args_for_process))
>   File 
> "/Users/dmitrydemeshchuk/minicon

[jira] [Commented] (BEAM-2572) Implement an S3 filesystem for Python SDK

2017-07-09 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16079846#comment-16079846
 ] 

Sourabh Bajaj commented on BEAM-2572:
-

+1 to not changing the signature of the ReadFromText.


> Implement an S3 filesystem for Python SDK
> -
>
> Key: BEAM-2572
> URL: https://issues.apache.org/jira/browse/BEAM-2572
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Dmitry Demeshchuk
>Assignee: Ahmet Altay
>Priority: Minor
>
> There are two paths worth exploring, to my understanding:
> 1. Sticking to the HDFS-based approach (like it's done in Java).
> 2. Using boto/boto3 for accessing S3 through its common API endpoints.
> I personally prefer the second approach, for a few reasons:
> 1. In real life, HDFS and S3 have different consistency guarantees, therefore 
> their behaviors may contradict each other in some edge cases (say, we write 
> something to S3, but it's not immediately accessible for reading from another 
> end).
> 2. There are other AWS-based sources and sinks we may want to create in the 
> future: DynamoDB, Kinesis, SQS, etc.
> 3. boto3 already provides somewhat good logic for basic things like 
> reattempting.
> Whatever path we choose, there's another problem related to this: we 
> currently cannot pass any global settings (say, pipeline options, or just an 
> arbitrary kwarg) to a filesystem. Because of that, we'd have to setup the 
> runner nodes to have AWS keys set up in the environment, which is not trivial 
> to achieve and doesn't look too clean either (I'd rather see one single place 
> for configuring the runner options).
> Also, it's worth mentioning that I already have a janky S3 filesystem 
> implementation that only supports DirectRunner at the moment (because of the 
> previous paragraph). I'm perfectly fine finishing it myself, with some 
> guidance from the maintainers.
> Where should I move on from here, and whose input should I be looking for?
> Thanks!



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (BEAM-2437) quickstart.py docs is missing the path to MANIFEST.in

2017-06-26 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063939#comment-16063939
 ] 

Sourabh Bajaj commented on BEAM-2437:
-

This can be closed now.

> quickstart.py docs is missing the path to MANIFEST.in
> -
>
> Key: BEAM-2437
> URL: https://issues.apache.org/jira/browse/BEAM-2437
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Jonathan Bingham
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> SUMMARY
> The wordcount example in quickstart-py does not work with the sample code 
> without modification.
> OBSERVED
> Copy-pasting from the doc page doesn't work:
> python -m apache_beam.examples.wordcount --input MANIFEST.in --output counts
> Error message: IOError: No files found based on the file pattern MANIFEST.in
> EXPECTED
> The example tells me to set the path to MANIFEST.in, or gives a pseudo-path 
> that I can substitute in the right path prefix.
> python -m apache_beam.examples.wordcount --input 
> /[path-to-git-clone-dir]/beam/sdks/python/MANIFEST.in --output counts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (BEAM-2437) quickstart.py docs is missing the path to MANIFEST.in

2017-06-26 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-2437:
---

Assignee: Sourabh Bajaj  (was: Ahmet Altay)

> quickstart.py docs is missing the path to MANIFEST.in
> -
>
> Key: BEAM-2437
> URL: https://issues.apache.org/jira/browse/BEAM-2437
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Jonathan Bingham
>Assignee: Sourabh Bajaj
>Priority: Minor
>  Labels: easyfix
> Fix For: 2.1.0
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> SUMMARY
> The wordcount example in quickstart-py does not work with the sample code 
> without modification.
> OBSERVED
> Copy-pasting from the doc page doesn't work:
> python -m apache_beam.examples.wordcount --input MANIFEST.in --output counts
> Error message: IOError: No files found based on the file pattern MANIFEST.in
> EXPECTED
> The example tells me to set the path to MANIFEST.in, or gives a pseudo-path 
> that I can substitute in the right path prefix.
> python -m apache_beam.examples.wordcount --input 
> /[path-to-git-clone-dir]/beam/sdks/python/MANIFEST.in --output counts



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk

2017-06-15 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1585.
-
Resolution: Fixed

With the BeamPlugin system we can add additional filesystems and other plugins 
now.

> Ability to add new file systems to beamFS in the python sdk
> ---
>
> Key: BEAM-1585
> URL: https://issues.apache.org/jira/browse/BEAM-1585
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: Not applicable
>
>
> BEAM-1441 implements the new BeamFileSystem in the python SDK but currently 
> lacks the ability to add user implemented file systems.
> This needs to be executed in the worker so should be packaged correctly with 
> the pipeline code. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Resolved] (BEAM-2405) Create a BigQuery sink for streaming using PTransform

2017-06-12 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2405?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-2405.
-
   Resolution: Fixed
Fix Version/s: 2.1.0

> Create a BigQuery sink for streaming using PTransform
> -
>
> Key: BEAM-2405
> URL: https://issues.apache.org/jira/browse/BEAM-2405
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: 2.1.0
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (BEAM-2419) Refactor dependency.py as the types are inconsistently passed in the filecopy code

2017-06-06 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2419:
---

 Summary: Refactor dependency.py as the types are inconsistently 
passed in the filecopy code
 Key: BEAM-2419
 URL: https://issues.apache.org/jira/browse/BEAM-2419
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2405) Create a BigQuery sink for streaming using PTransform

2017-06-02 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2405:
---

 Summary: Create a BigQuery sink for streaming using PTransform
 Key: BEAM-2405
 URL: https://issues.apache.org/jira/browse/BEAM-2405
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2338) GCS filepattern wildcard broken in Python SDK

2017-05-23 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16022292#comment-16022292
 ] 

Sourabh Bajaj commented on BEAM-2338:
-

This can be resolved now.

> GCS filepattern wildcard broken in Python SDK
> -
>
> Key: BEAM-2338
> URL: https://issues.apache.org/jira/browse/BEAM-2338
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Affects Versions: 2.0.0
>Reporter: Vilhelm von Ehrenheim
>Assignee: Sourabh Bajaj
> Fix For: 2.1.0
>
>
> Validation of file patterns containing wildcard (`*`) in GCS directories does 
> not always work. 
> Some kinds of patterns generates an error from here during validation:
> https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
> I've tried a few different FileSystems match commands which confuses be a bit.
> Full path works:
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
>  74721736)]
> {noformat}
> Glob star on directory does not
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> []
> {noformat}
> If adding a star on the file level only searching for TIF files it works (all 
> tough we match a different file but that is fine)
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
>  65862791)]
> {noformat}
> Ok, Here comes the even more strange case. 
> Looking for the same file we found with the patterns that but with a star on 
> the dir we find it!!
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
>  65862791)]
> {noformat}
> Also looking at the first case again we will match if the star is placed late 
> enough in the pattern to make the directory unique.
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
>  74721736)]
> {noformat}
> but not if further up in the name
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> []
> {noformat}
> My guess is that some folders are dropped from the list of matched 
> directories or something which is a bit concerning. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2338) GCS filepattern wildcard broken in Python SDK

2017-05-23 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021677#comment-16021677
 ] 

Sourabh Bajaj commented on BEAM-2338:
-

You can work-around this issue by passing "validate=False" to the source and 
that would prevent the pipeline from hitting this bug

> GCS filepattern wildcard broken in Python SDK
> -
>
> Key: BEAM-2338
> URL: https://issues.apache.org/jira/browse/BEAM-2338
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model
>Affects Versions: 2.0.0
>Reporter: Vilhelm von Ehrenheim
>Assignee: Sourabh Bajaj
> Fix For: 2.1.0
>
>
> Validation of file patterns containing wildcard (`*`) in GCS directories does 
> not always work. 
> Some kinds of patterns generates an error from here during validation:
> https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
> I've tried a few different FileSystems match commands which confuses be a bit.
> Full path works:
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
>  74721736)]
> {noformat}
> Glob star on directory does not
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> []
> {noformat}
> If adding a star on the file level only searching for TIF files it works (all 
> tough we match a different file but that is fine)
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
>  65862791)]
> {noformat}
> Ok, Here comes the even more strange case. 
> Looking for the same file we found with the patterns that but with a star on 
> the dir we find it!!
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
>  65862791)]
> {noformat}
> Also looking at the first case again we will match if the star is placed late 
> enough in the pattern to make the directory unique.
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
>  74721736)]
> {noformat}
> but not if further up in the name
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
> >>>  limits=[1])[0].metadata_list
> []
> {noformat}
> My guess is that some folders are dropped from the list of matched 
> directories or something which is a bit concerning. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1833) Restructure Python pipeline construction to better follow the Runner API

2017-05-22 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16020075#comment-16020075
 ] 

Sourabh Bajaj commented on BEAM-1833:
-

Un-assigning this as it is not a blocker for RunnerAPI

> Restructure Python pipeline construction to better follow the Runner API
> 
>
> Key: BEAM-1833
> URL: https://issues.apache.org/jira/browse/BEAM-1833
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Robert Bradshaw
>Assignee: Sourabh Bajaj
>
> The most important part is removing the runner.apply overrides, but there are 
> also various other improvements (e.g. all inputs and outputs should be named).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1833) Restructure Python pipeline construction to better follow the Runner API

2017-05-22 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1833?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-1833:
---

Assignee: (was: Sourabh Bajaj)

> Restructure Python pipeline construction to better follow the Runner API
> 
>
> Key: BEAM-1833
> URL: https://issues.apache.org/jira/browse/BEAM-1833
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Robert Bradshaw
>
> The most important part is removing the runner.apply overrides, but there are 
> also various other improvements (e.g. all inputs and outputs should be named).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-2250) Remove FnHarness code from PyDocs

2017-05-10 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2250?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-2250.
-
Resolution: Fixed
  Assignee: Sourabh Bajaj  (was: Ahmet Altay)

> Remove FnHarness code from PyDocs
> -
>
> Key: BEAM-2250
> URL: https://issues.apache.org/jira/browse/BEAM-2250
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2250) Remove FnHarness code from PyDocs

2017-05-10 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2250:
---

 Summary: Remove FnHarness code from PyDocs
 Key: BEAM-2250
 URL: https://issues.apache.org/jira/browse/BEAM-2250
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay
 Fix For: 2.0.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1283) DoFn finishBundle should be required to specify the window for output

2017-05-09 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16003119#comment-16003119
 ] 

Sourabh Bajaj commented on BEAM-1283:
-

This can be closed now.

> DoFn finishBundle should be required to specify the window for output
> -
>
> Key: BEAM-1283
> URL: https://issues.apache.org/jira/browse/BEAM-1283
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Kenneth Knowles
>Assignee: Sourabh Bajaj
>  Labels: backward-incompatible
> Fix For: 2.0.0
>
>
> The spec is here in Javadoc: 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L128
> "If invoked from {{@StartBundle}} or {{@FinishBundle}}, this will attempt to 
> use the {{WindowFn}} of the input {{PCollection}} to determine what windows 
> the element should be in, throwing an exception if the {{WindowFn}} attempts 
> to access any information about the input element. The output element will 
> have a timestamp of negative infinity."
> This is a collection of caveats that make this method not always technically 
> wrong, but quite a mess. Ideas that reasonable folks have suggested lately:
>  - The {{WindowFn}} cannot actually be applied because {{WindowFn}} is 
> allowed to see the element type. The spec just avoids this by limiting which 
> {{WindowFn}} can be used.
>  - There is no natural output timestamp, so it should always be provided. The 
> spec avoids this by specifying an arbitrary and fairly useless timestamp.
>  - If it is a merging {{WindowFn}} like sessions that has already been merged 
> then you'll just have a bogus proto window regardless of explicit timestamp 
> or not.
> The use cases for these methods are best addressed by state plus window 
> expiry callback, so we should revisit this spec and probably just wipe it.
> There are some rare case where you might need to output from {{FinishBundle}} 
> in a way that is not _actually_ sensitive to bundling (perhaps modulo some 
> downstream notion of equivalence) in which case you had better know what 
> window you are outputting to. Often it should be the global window.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2206) Move pipeline options into separate package from beam/utils

2017-05-07 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2206:
---

 Summary: Move pipeline options into separate package from 
beam/utils
 Key: BEAM-2206
 URL: https://issues.apache.org/jira/browse/BEAM-2206
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj
 Fix For: 2.0.0






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2189) ImportError get_filesystem

2017-05-05 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999311#comment-15999311
 ] 

Sourabh Bajaj commented on BEAM-2189:
-

This can be closed now

> ImportError get_filesystem
> --
>
> Key: BEAM-2189
> URL: https://issues.apache.org/jira/browse/BEAM-2189
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>Priority: Blocker
> Fix For: 2.0.0
>
>
> Dataflow tests are broken after https://github.com/apache/beam/pull/2807



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors

2017-05-04 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997542#comment-15997542
 ] 

Sourabh Bajaj commented on BEAM-2143:
-

Python is giving the right error here. [~vikasrk] do you want to take a stab at 
the Java part?

> (Mis)Running Dataflow Wordcount gives non-helpful errors
> 
>
> Key: BEAM-2143
> URL: https://issues.apache.org/jira/browse/BEAM-2143
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow, sdk-java-gcp
>Reporter: Ben Chambers
>Assignee: Sourabh Bajaj
>Priority: Blocker
> Fix For: First stable release
>
>
> If you run a pipeline and forget to specify `tempLocation` (but specify 
> something else, such as `stagingLocation`) you get two messages indicating 
> you forgot to specify `stagingLocation`. 
> One says "no stagingLocation specified, choosing ..." the other says "error, 
> the staging location isn't readable" (if you give it just a bucket and not an 
> object within a bucket).
> This is surprising to me as a user, since (1) I specified a staging location 
> and (2) the flag I actually need to modify is `--tempLocation`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors

2017-05-04 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-2143:

Component/s: sdk-java-gcp

> (Mis)Running Dataflow Wordcount gives non-helpful errors
> 
>
> Key: BEAM-2143
> URL: https://issues.apache.org/jira/browse/BEAM-2143
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow, sdk-java-gcp
>Reporter: Ben Chambers
>Assignee: Sourabh Bajaj
>Priority: Blocker
> Fix For: First stable release
>
>
> If you run a pipeline and forget to specify `tempLocation` (but specify 
> something else, such as `stagingLocation`) you get two messages indicating 
> you forgot to specify `stagingLocation`. 
> One says "no stagingLocation specified, choosing ..." the other says "error, 
> the staging location isn't readable" (if you give it just a bucket and not an 
> object within a bucket).
> This is surprising to me as a user, since (1) I specified a staging location 
> and (2) the flag I actually need to modify is `--tempLocation`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-2143) (Mis)Running Dataflow Wordcount gives non-helpful errors

2017-05-04 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15997488#comment-15997488
 ] 

Sourabh Bajaj commented on BEAM-2143:
-

[~bchambers] do you see this in both sdks? 

> (Mis)Running Dataflow Wordcount gives non-helpful errors
> 
>
> Key: BEAM-2143
> URL: https://issues.apache.org/jira/browse/BEAM-2143
> Project: Beam
>  Issue Type: Bug
>  Components: runner-dataflow
>Reporter: Ben Chambers
>Assignee: Sourabh Bajaj
>Priority: Blocker
> Fix For: First stable release
>
>
> If you run a pipeline and forget to specify `tempLocation` (but specify 
> something else, such as `stagingLocation`) you get two messages indicating 
> you forgot to specify `stagingLocation`. 
> One says "no stagingLocation specified, choosing ..." the other says "error, 
> the staging location isn't readable" (if you give it just a bucket and not an 
> object within a bucket).
> This is surprising to me as a user, since (1) I specified a staging location 
> and (2) the flag I actually need to modify is `--tempLocation`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-2163) Remove the dependency on examples from ptransform_test

2017-05-03 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2163?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-2163:
---

Assignee: (was: Ahmet Altay)

> Remove the dependency on examples from ptransform_test
> --
>
> Key: BEAM-2163
> URL: https://issues.apache.org/jira/browse/BEAM-2163
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>  Labels: newbie
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform_test.py#L176
> This validates runner test depends on the Counting source snippet and the 
> source should be copied here instead of this dependency.
> The actual beam code should not depend on the examples package at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2163) Remove the dependency on examples from ptransform_test

2017-05-03 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2163:
---

 Summary: Remove the dependency on examples from ptransform_test
 Key: BEAM-2163
 URL: https://issues.apache.org/jira/browse/BEAM-2163
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay


https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/ptransform_test.py#L176

This validates runner test depends on the Counting source snippet and the 
source should be copied here instead of this dependency.

The actual beam code should not depend on the examples package at all.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-2152) Authentication fails if there is an unauthenticated gcloud tool even if application default credentials are available

2017-05-03 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2152?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-2152:
---

 Assignee: Sourabh Bajaj
Fix Version/s: First stable release

> Authentication fails if there is an unauthenticated gcloud tool even if 
> application default credentials are available
> -
>
> Key: BEAM-2152
> URL: https://issues.apache.org/jira/browse/BEAM-2152
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
> Fix For: First stable release
>
>
> In a machine that has valid application default credentials, if {{gcloud}} 
> tool is not installed authentication works. If {{gcloud}} tool is recently 
> installed but has not authenticated yet authentication fails with {{You do 
> not currently have an active account selected.}}
> Authentication code should fallback to default method in this case. (Or the 
> {{gcloud}} based authentication needs to be fully removed.
> cc: [~lcwik] 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1954) "test" extra need nose in the requirements list

2017-05-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1954?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1954:

Labels: newbie  (was: )

> "test" extra need nose in the requirements list
> ---
>
> Key: BEAM-1954
> URL: https://issues.apache.org/jira/browse/BEAM-1954
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Affects Versions: 0.6.0
>Reporter: Marc Boissonneault
>Assignee: Ahmet Altay
>Priority: Minor
>  Labels: newbie
>
> When installing beam with pip:
> {code}
> pip install apache-beam[gcp,test]==0.6.0
> {code}
> And trying to use the TestPipeline:
> {code}
> from apache_beam.test_pipeline import TestPipeline
> {code}
> We receive the following error:
> {panel}
> from nose.plugins.skip import SkipTest
> ImportError: No module named nose.plugins.skip
> {panel}
> Once nose is installed everything works as expected
> See: https://github.com/apache/beam/blob/master/sdks/python/setup.py#L100-L102



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1316) DoFn#startBundle should not be able to output

2017-05-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1316:

Component/s: (was: sdk-py)

> DoFn#startBundle should not be able to output
> -
>
> Key: BEAM-1316
> URL: https://issues.apache.org/jira/browse/BEAM-1316
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Thomas Groh
>Assignee: Thomas Groh
> Fix For: First stable release
>
>
> While within startBundle and finishBundle, the window in which elements are 
> output is not generally defined. Elements must always be output from within a 
> windowed context, or the {{WindowFn}} used by the {{PCollection}} may not 
> operate appropriately.
> startBundle and finishBundle are suitable for operational duties, similarly 
> to {{setup}} and {{teardown}}, but within the scope of some collection of 
> input elements. This includes actions such as clearing field state within a 
> DoFn and ensuring all live RPCs complete successfully before committing 
> inputs.
> Sometimes it might be reasonable to output from {{@FinishBundle}} but it is 
> hard to imagine a situation where output from {{@StartBundle}} is useful in a 
> way that doesn't seriously abuse things.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1316) DoFn#startBundle should not be able to output

2017-05-02 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15993437#comment-15993437
 ] 

Sourabh Bajaj commented on BEAM-1316:
-

[~altay] done.

> DoFn#startBundle should not be able to output
> -
>
> Key: BEAM-1316
> URL: https://issues.apache.org/jira/browse/BEAM-1316
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core
>Reporter: Thomas Groh
>Assignee: Thomas Groh
> Fix For: First stable release
>
>
> While within startBundle and finishBundle, the window in which elements are 
> output is not generally defined. Elements must always be output from within a 
> windowed context, or the {{WindowFn}} used by the {{PCollection}} may not 
> operate appropriately.
> startBundle and finishBundle are suitable for operational duties, similarly 
> to {{setup}} and {{teardown}}, but within the scope of some collection of 
> input elements. This includes actions such as clearing field state within a 
> DoFn and ensuring all live RPCs complete successfully before committing 
> inputs.
> Sometimes it might be reasonable to output from {{@FinishBundle}} but it is 
> hard to imagine a situation where output from {{@StartBundle}} is useful in a 
> way that doesn't seriously abuse things.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1316) DoFn#startBundle should not be able to output

2017-04-26 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1316:

Component/s: sdk-py

> DoFn#startBundle should not be able to output
> -
>
> Key: BEAM-1316
> URL: https://issues.apache.org/jira/browse/BEAM-1316
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-java-core, sdk-py
>Reporter: Thomas Groh
>Assignee: Thomas Groh
> Fix For: First stable release
>
>
> While within startBundle and finishBundle, the window in which elements are 
> output is not generally defined. Elements must always be output from within a 
> windowed context, or the {{WindowFn}} used by the {{PCollection}} may not 
> operate appropriately.
> startBundle and finishBundle are suitable for operational duties, similarly 
> to {{setup}} and {{teardown}}, but within the scope of some collection of 
> input elements. This includes actions such as clearing field state within a 
> DoFn and ensuring all live RPCs complete successfully before committing 
> inputs.
> Sometimes it might be reasonable to output from {{@FinishBundle}} but it is 
> hard to imagine a situation where output from {{@StartBundle}} is useful in a 
> way that doesn't seriously abuse things.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1989) clean SyntaxWarning

2017-04-25 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984118#comment-15984118
 ] 

Sourabh Bajaj commented on BEAM-1989:
-

This can be closed now.

> clean SyntaxWarning
> ---
>
> Key: BEAM-1989
> URL: https://issues.apache.org/jira/browse/BEAM-1989
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>Priority: Minor
>
> apache_beam/io/gcp/bigquery.py:326: SyntaxWarning: import * only allowed at 
> module level
>   def __init__(self, table=None, dataset=None, project=None, query=None,
> apache_beam/io/gcp/bigquery.py:431: SyntaxWarning: import * only allowed at 
> module level
>   def __init__(self, table, dataset=None, project=None, schema=None,
> cc: [~sb2nov][~chamikara]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1749) Upgrade pep8 to pycodestyle

2017-04-25 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15984116#comment-15984116
 ] 

Sourabh Bajaj commented on BEAM-1749:
-

This can be closed now.

> Upgrade pep8 to pycodestyle
> ---
>
> Key: BEAM-1749
> URL: https://issues.apache.org/jira/browse/BEAM-1749
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>  Labels: newbie, starter
>
> pep8 was deprecated and replaced with pycodestyle
> We should upgrade our linter to this module, and while doing that re-evaluate 
> our linter strategy and see if we can enable more rules. This is important 
> for keeping the code healthy as the community grows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1989) clean SyntaxWarning

2017-04-25 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-1989:
---

Assignee: Sourabh Bajaj

> clean SyntaxWarning
> ---
>
> Key: BEAM-1989
> URL: https://issues.apache.org/jira/browse/BEAM-1989
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>Priority: Minor
>
> apache_beam/io/gcp/bigquery.py:326: SyntaxWarning: import * only allowed at 
> module level
>   def __init__(self, table=None, dataset=None, project=None, query=None,
> apache_beam/io/gcp/bigquery.py:431: SyntaxWarning: import * only allowed at 
> module level
>   def __init__(self, table, dataset=None, project=None, schema=None,
> cc: [~sb2nov][~chamikara]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1749) Upgrade pep8 to pycodestyle

2017-04-25 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1749?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-1749:
---

Assignee: Sourabh Bajaj

> Upgrade pep8 to pycodestyle
> ---
>
> Key: BEAM-1749
> URL: https://issues.apache.org/jira/browse/BEAM-1749
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>  Labels: newbie, starter
>
> pep8 was deprecated and replaced with pycodestyle
> We should upgrade our linter to this module, and while doing that re-evaluate 
> our linter strategy and see if we can enable more rules. This is important 
> for keeping the code healthy as the community grows.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-2068) Upgrade Google-Apitools to latest version

2017-04-25 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-2068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-2068.
-
   Resolution: Fixed
Fix Version/s: First stable release

> Upgrade Google-Apitools to latest version
> -
>
> Key: BEAM-2068
> URL: https://issues.apache.org/jira/browse/BEAM-2068
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Ahmet Altay
>Priority: Minor
> Fix For: First stable release
>
>
> In 0.5.9 apitools is pinned to setuptools 18.5 which is really old as the 
> current release is 35.0.1 at the time of creating the issue. Updating to 
> 0.5.9 causes issues for other dependencies so we're going to try to address 
> this upstream first and then upgrade to the latest version in the future.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1283) DoFn finishBundle should be required to specify the window for output

2017-04-25 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1283?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15983733#comment-15983733
 ] 

Sourabh Bajaj commented on BEAM-1283:
-

[~kenn] have we reached a resolution on this?

Currently in Python if the user doesn't return a timestamped value we assign it 
to -1 for the output[1]. As we don't have state and timers implemented 
completely yet should we just assert and that we're getting timestampedValues 
only.

[1] 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/common.py#L310

> DoFn finishBundle should be required to specify the window for output
> -
>
> Key: BEAM-1283
> URL: https://issues.apache.org/jira/browse/BEAM-1283
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-java-core, sdk-py
>Reporter: Kenneth Knowles
>Assignee: Thomas Groh
>  Labels: backward-incompatible
> Fix For: First stable release
>
>
> The spec is here in Javadoc: 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/transforms/DoFn.java#L128
> "If invoked from {{@StartBundle}} or {{@FinishBundle}}, this will attempt to 
> use the {{WindowFn}} of the input {{PCollection}} to determine what windows 
> the element should be in, throwing an exception if the {{WindowFn}} attempts 
> to access any information about the input element. The output element will 
> have a timestamp of negative infinity."
> This is a collection of caveats that make this method not always technically 
> wrong, but quite a mess. Ideas that reasonable folks have suggested lately:
>  - The {{WindowFn}} cannot actually be applied because {{WindowFn}} is 
> allowed to see the element type. The spec just avoids this by limiting which 
> {{WindowFn}} can be used.
>  - There is no natural output timestamp, so it should always be provided. The 
> spec avoids this by specifying an arbitrary and fairly useless timestamp.
>  - If it is a merging {{WindowFn}} like sessions that has already been merged 
> then you'll just have a bogus proto window regardless of explicit timestamp 
> or not.
> The use cases for these methods are best addressed by state plus window 
> expiry callback, so we should revisit this spec and probably just wipe it.
> There are some rare case where you might need to output from {{FinishBundle}} 
> in a way that is not _actually_ sensitive to bundling (perhaps modulo some 
> downstream notion of equivalence) in which case you had better know what 
> window you are outputting to. Often it should be the global window.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1956) Flatten operation should respect input type hints.

2017-04-25 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1956:

Fix Version/s: (was: First stable release)

> Flatten operation should respect input type hints.
> --
>
> Key: BEAM-1956
> URL: https://issues.apache.org/jira/browse/BEAM-1956
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Vikas Kedigehalli
>Assignee: Ahmet Altay
>
> Input type hints are currently not respected by the Flatten operation and 
> instead `Any` type is chosen as a fallback. This could lead to using a pickle 
> coder even if there was a custom coder type hint provided for input 
> PCollections. 
> Also, this could lead to undesirable results, particularly, when a Flatten 
> operation is followed by a GroupByKey operation which requires the key coder 
> to be deterministic. Even if the user provides deterministic coder type hints 
> to their PCollections, defaulting to Any would result in using the pickle 
> coder (non-deterministic). As a result of this, CoGroupByKey is broken in 
> such scenarios where input PCollection coder is deterministic for the type 
> while pickle coder is not.   



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-2068) Upgrade Google-Apitools to latest version

2017-04-24 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-2068:
---

 Summary: Upgrade Google-Apitools to latest version
 Key: BEAM-2068
 URL: https://issues.apache.org/jira/browse/BEAM-2068
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay
Priority: Minor


In 0.5.9 apitools is pinned to setuptools 18.5 which is really old as the 
current release is 35.0.1 at the time of creating the issue. Updating to 0.5.9 
causes issues for other dependencies so we're going to try to address this 
upstream first and then upgrade to the latest version in the future.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1988) utils.path.join does not correctly handle GCS bucket roots

2017-04-21 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1988:

Fix Version/s: First stable release

> utils.path.join does not correctly handle GCS bucket roots
> --
>
> Key: BEAM-1988
> URL: https://issues.apache.org/jira/browse/BEAM-1988
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
> Fix For: First stable release
>
>
> Here:
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/utils/path.py#L22
> Joining a bucket root with a filename e.g. (gs://mybucket/ , myfile) results 
> in invalid 'gs://mybucket//myfile', notice the double // between mybucket and 
> myfile. (It actually does not handle anything that already ends with {{/}} 
> correctly)
> [~sb2nov] could you take this one? Also, should the `join` operation move to 
> a BeamFileSystem level code.
> (cc: [~chamikara])



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-662) SlidingWindows should support sub-second periods

2017-04-20 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977122#comment-15977122
 ] 

Sourabh Bajaj commented on BEAM-662:


This can be closed now.

> SlidingWindows should support sub-second periods
> 
>
> Key: BEAM-662
> URL: https://issues.apache.org/jira/browse/BEAM-662
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Daniel Mills
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: First stable release
>
>
> SlidingWindows periods are being rounded to seconds, see 
> http://stackoverflow.com/questions/39604646/can-slidingwindows-have-half-second-periods-in-python-apache-beam



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1964) Upgrade pylint to 1.7.0

2017-04-14 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15969708#comment-15969708
 ] 

Sourabh Bajaj commented on BEAM-1964:
-

[~altay] this can be resolved now.

> Upgrade pylint to 1.7.0
> ---
>
> Key: BEAM-1964
> URL: https://issues.apache.org/jira/browse/BEAM-1964
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Aviem Zur
>Assignee: Sourabh Bajaj
>
> Pre-commit tests seem to all be failing on pylint
> For example: 
> https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_MavenInstall/9493/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-1964) Upgrade pylint to 1.7.0

2017-04-14 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-1964:
---

Assignee: Sourabh Bajaj  (was: Ahmet Altay)

> Upgrade pylint to 1.7.0
> ---
>
> Key: BEAM-1964
> URL: https://issues.apache.org/jira/browse/BEAM-1964
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Aviem Zur
>Assignee: Sourabh Bajaj
>
> Pre-commit tests seem to all be failing on pylint
> For example: 
> https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_MavenInstall/9493/consoleFull



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-778) Make filesystem._CompressedFile seekable.

2017-04-13 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968309#comment-15968309
 ] 

Sourabh Bajaj commented on BEAM-778:


This can be closed now.

> Make filesystem._CompressedFile seekable.
> -
>
> Key: BEAM-778
> URL: https://issues.apache.org/jira/browse/BEAM-778
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Tibor Kiss
> Fix For: Not applicable
>
>
> We have a TODO to make filesystem._CompressedFile seekable.
> https://github.com/apache/incubator-beam/blob/python-sdk/sdks/python/apache_beam/io/fileio.py#L692
> Without this, compressed file objects produce for FileBasedSource 
> implementations may not be able to use libraries that utilize methods seek() 
> and tell().
> For example tarfile.open().



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1068) Service Account Credentials File Specified via Pipeline Option Ignored

2017-04-13 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968307#comment-15968307
 ] 

Sourabh Bajaj commented on BEAM-1068:
-

This can be closed now.

> Service Account Credentials File Specified via Pipeline Option Ignored
> --
>
> Key: BEAM-1068
> URL: https://issues.apache.org/jira/browse/BEAM-1068
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
> Environment: CentOS Linux release 7.1.1503 (Core)
> Python 2.7.5
>Reporter: Stephen Reichling
>Assignee: Ahmet Altay
>Priority: Minor
> Fix For: First stable release
>
>
> When writing a pipeline that authenticates with Google Dataflow APIs using a 
> service account, specifying the path to that service account's credentials 
> file in the {{PipelineOptions}} object passed in to the pipeline does not 
> work, it only works when passed as a command-line flag.
> For example, if I write code like so:
> {code}
> pipelineOptions = options.PipelineOptions()
> gcOptions = pipelineOptions.view_as(options.GoogleCloudOptions)
> gcOptions.service_account_name = 'My Service Account Name'
> gcOptions.service_account_key_file = '/some/path/keyfile.p12'
> pipeline = beam.Pipeline(options=pipelineOptions)
> # ... add stages to the pipeline
> p.run()
> {code}
> and execute it like so:
> {{python ./my_pipeline.py}}
> ...the service account I specify will not be used.
> Only if I were to execute the code like so:
> {{python ./my_pipeline.py --service_account_name 'My Service Account Name' 
> --service_account_key_file /some/path/keyfile.p12}}
> ...does it actually use the service account.
> The problem appears to be rooted in `auth.py` which reconstructs the 
> {{PipelineOptions}} object directly from {{sys.argv}} rather than using the 
> instance passed in to the pipeline: 
> https://github.com/apache/incubator-beam/blob/9ded359daefc6040d61a1f33c77563474fcb09b6/sdks/python/apache_beam/internal/auth.py#L129-L130



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1101) Remove inconsistencies in Python PipelineOptions

2017-04-13 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1101?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15968305#comment-15968305
 ] 

Sourabh Bajaj commented on BEAM-1101:
-

This can be closed now.

> Remove inconsistencies in Python PipelineOptions
> 
>
> Key: BEAM-1101
> URL: https://issues.apache.org/jira/browse/BEAM-1101
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Pablo Estrada
>Assignee: Sourabh Bajaj
> Fix For: First stable release
>
>
> Some options have been removed from Java, and some have different default 
> behavior in Java. Gotta make this consistent.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1708) Better error messages when GCP features are not installed

2017-04-13 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1708.
-
   Resolution: Fixed
Fix Version/s: First stable release

> Better error messages when GCP features are not installed 
> --
>
> Key: BEAM-1708
> URL: https://issues.apache.org/jira/browse/BEAM-1708
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: First stable release
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1068) Service Account Credentials File Specified via Pipeline Option Ignored

2017-04-13 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1068:

Fix Version/s: First stable release

> Service Account Credentials File Specified via Pipeline Option Ignored
> --
>
> Key: BEAM-1068
> URL: https://issues.apache.org/jira/browse/BEAM-1068
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
> Environment: CentOS Linux release 7.1.1503 (Core)
> Python 2.7.5
>Reporter: Stephen Reichling
>Assignee: Ahmet Altay
>Priority: Minor
> Fix For: First stable release
>
>
> When writing a pipeline that authenticates with Google Dataflow APIs using a 
> service account, specifying the path to that service account's credentials 
> file in the {{PipelineOptions}} object passed in to the pipeline does not 
> work, it only works when passed as a command-line flag.
> For example, if I write code like so:
> {code}
> pipelineOptions = options.PipelineOptions()
> gcOptions = pipelineOptions.view_as(options.GoogleCloudOptions)
> gcOptions.service_account_name = 'My Service Account Name'
> gcOptions.service_account_key_file = '/some/path/keyfile.p12'
> pipeline = beam.Pipeline(options=pipelineOptions)
> # ... add stages to the pipeline
> p.run()
> {code}
> and execute it like so:
> {{python ./my_pipeline.py}}
> ...the service account I specify will not be used.
> Only if I were to execute the code like so:
> {{python ./my_pipeline.py --service_account_name 'My Service Account Name' 
> --service_account_key_file /some/path/keyfile.p12}}
> ...does it actually use the service account.
> The problem appears to be rooted in `auth.py` which reconstructs the 
> {{PipelineOptions}} object directly from {{sys.argv}} rather than using the 
> instance passed in to the pipeline: 
> https://github.com/apache/incubator-beam/blob/9ded359daefc6040d61a1f33c77563474fcb09b6/sdks/python/apache_beam/internal/auth.py#L129-L130



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1222) Move GCS specific constants out of fileio

2017-04-12 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15966746#comment-15966746
 ] 

Sourabh Bajaj commented on BEAM-1222:
-

[~chamikara] this can be resolved now.

> Move GCS specific constants out of fileio
> -
>
> Key: BEAM-1222
> URL: https://issues.apache.org/jira/browse/BEAM-1222
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Sourabh Bajaj
>
> Currently fileio.py has GCS specific constant gcsio.MAX_BATCH_OPERATION_SIZE 
> which should be moved out of it. This will be needed to implement 
> IOChannelFactory proposal [1] for Python SDK.
> [1] 
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1924) pip download failed in post commit

2017-04-10 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15963254#comment-15963254
 ] 

Sourabh Bajaj commented on BEAM-1924:
-

One improvement that comes to mind is that we should pass the subprocess.PIPE 
options in this call similar to how we do it in the juliaset example

> pip download failed in post commit
> --
>
> Key: BEAM-1924
> URL: https://issues.apache.org/jira/browse/BEAM-1924
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>
> A post commit failed with a pip failure:
> https://builds.apache.org/job/beam_PostCommit_Python_Verify/1801/consoleFull
> Captured output is not clear enough to tell what went wrong:
> ==
> ERROR: test_par_do_with_multiple_outputs_and_using_return 
> (apache_beam.transforms.ptransform_test.PTransformTest)
> --
> Traceback (most recent call last):
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/transforms/ptransform_test.py",
>  line 207, in test_par_do_with_multiple_outputs_and_using_return
> pipeline.run()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/test_pipeline.py",
>  line 91, in run
> result = super(TestPipeline, self).run()
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py",
>  line 160, in run
> self.to_runner_api(), self.runner, self.options).run(False)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/pipeline.py",
>  line 169, in run
> return self.runner.run(self)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/test_dataflow_runner.py",
>  line 39, in run
> self.result = super(TestDataflowRunner, self).run(pipeline)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py",
>  line 175, in run
> self.dataflow_client.create_job(self.job), self)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/utils/retry.py",
>  line 174, in wrapper
> return fun(*args, **kwargs)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py",
>  line 425, in create_job
> self.create_job_description(job)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/apiclient.py",
>  line 446, in create_job_description
> job.options, file_copy=self._gcs_file_copy)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/dependency.py",
>  line 306, in stage_job_resources
> setup_options.requirements_file, requirements_cache_path)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/runners/dataflow/internal/dependency.py",
>  line 242, in _populate_requirements_cache
> processes.check_call(cmd_args)
>   File 
> "/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/apache_beam/utils/processes.py",
>  line 40, in check_call
> return subprocess.check_call(*args, **kwargs)
>   File "/usr/lib/python2.7/subprocess.py", line 540, in check_call
> raise CalledProcessError(retcode, cmd)
> CalledProcessError: Command 
> '['/home/jenkins/jenkins-slave/workspace/beam_PostCommit_Python_Verify/sdks/python/bin/python',
>  '-m', 'pip', 'install', '--download', '/tmp/dataflow-requirements-cache', 
> '-r', 'postcommit_requirements.txt', '--no-binary', ':all:']' returned 
> non-zero exit status 2
> Any ideas on how we can debug this failure or improve for the future? (cc:
>  [~markflyhigh] [~sb2nov])



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1892) Log process during size estimation in filebasedsource

2017-04-06 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1892?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1892.
-
   Resolution: Fixed
Fix Version/s: Not applicable

> Log process during size estimation in filebasedsource
> -
>
> Key: BEAM-1892
> URL: https://issues.apache.org/jira/browse/BEAM-1892
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: Not applicable
>
>
> http://stackoverflow.com/questions/43095445/how-to-iterate-all-files-in-google-cloud-storage-to-be-used-as-dataflow-input
> The user mentioned that there was no output and a huge delay in submitting 
> the pipeline. The file size estimation process can be slow for really large 
> datasets and this reports no process to the end user right now. We should be 
> logging process and thresholding the pre submission size estimation as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1892) Log process during size estimation in filebasedsource

2017-04-05 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1892:
---

 Summary: Log process during size estimation in filebasedsource
 Key: BEAM-1892
 URL: https://issues.apache.org/jira/browse/BEAM-1892
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj


http://stackoverflow.com/questions/43095445/how-to-iterate-all-files-in-google-cloud-storage-to-be-used-as-dataflow-input

The user mentioned that there was no output and a huge delay in submitting the 
pipeline. The file size estimation process can be slow for really large 
datasets and this reports no process to the end user right now. We should be 
logging process and thresholding the pre submission size estimation as well.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1711) Document extra features on quick start guide

2017-04-03 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15954289#comment-15954289
 ] 

Sourabh Bajaj commented on BEAM-1711:
-

This can be closed now.

> Document extra features on quick start guide
> 
>
> Key: BEAM-1711
> URL: https://issues.apache.org/jira/browse/BEAM-1711
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>
> Add something like below to avoid confusion
> """
> You may need extra packages for some additional features and this is the list 
> of extra_features and what they do.
> feature1: required for x, y,  z
> """



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1441) Add an IOChannelFactory interface to Python SDK

2017-04-03 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1441?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953980#comment-15953980
 ] 

Sourabh Bajaj commented on BEAM-1441:
-

This can be closed now.

> Add an IOChannelFactory interface to Python SDK
> ---
>
> Key: BEAM-1441
> URL: https://issues.apache.org/jira/browse/BEAM-1441
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Chamikara Jayalath
>Assignee: Sourabh Bajaj
> Fix For: First stable release
>
>
> Based on proposal [1], an IOChannelFactory interface was added to Java SDK  
> [2].
> We should add a similar interface to Python SDK and provide proper 
> implementations for native files, GCS, and other useful formats.
> Python SDK currently has a basic ChannelFactory interface [3] which is used 
> by FileBasedSource [4].
> [1] 
> https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#heading=h.kpqagzh8i11w
> [2] 
> https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/util/IOChannelFactory.java
> [3] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/fileio.py#L107
> [4] 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/filebasedsource.py
>  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1264) Python ChannelFactory Raise Inconsistent Error for Local FS and GCS

2017-04-03 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15953978#comment-15953978
 ] 

Sourabh Bajaj commented on BEAM-1264:
-

Yes should be closed now as BFS is merged now.

> Python ChannelFactory Raise Inconsistent Error for Local FS and GCS
> ---
>
> Key: BEAM-1264
> URL: https://issues.apache.org/jira/browse/BEAM-1264
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Mark Liu
>Assignee: Sourabh Bajaj
>
> ChannelFactory raises different errors for local fs (RuntimeError) and GCS 
> (IOError) when reading failed. 
> We want to return consistent error for both.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1793) Frequent python post commit errors

2017-03-27 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15943806#comment-15943806
 ] 

Sourabh Bajaj commented on BEAM-1793:
-

Should be ready to close now.

> Frequent python post commit errors
> --
>
> Key: BEAM-1793
> URL: https://issues.apache.org/jira/browse/BEAM-1793
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>Priority: Critical
>
> 1. Failed because virtualenv was already installed. And postcommit script 
> made a wrong assumption about its location.
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1499/consoleFull
> 2. Failed because a really old version of pip is installed:
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1585/consoleFull
> (Also #1586 and #1597)
> 3. Failed because a file was already in the cache and pip did not want to 
> override it, even though this is not usually a problem:
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1596/consoleFull
> (Might be related: https://issues.apache.org/jira/browse/BEAM-1788)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1817) Python environment is different on some nodes

2017-03-27 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1817:
---

 Summary: Python environment is different on some nodes
 Key: BEAM-1817
 URL: https://issues.apache.org/jira/browse/BEAM-1817
 Project: Beam
  Issue Type: Bug
  Components: testing
Reporter: Sourabh Bajaj
Assignee: Jason Kuster


We've pined the python post commits to specific nodes as pip in some of the new 
nodes seems to be installed differently from the ones before.

https://github.com/apache/beam/pull/2308/files

Please remove the changes in above PR once this is resolved.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-541) Add more documentation on DoFn Annotations

2017-03-23 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15939234#comment-15939234
 ] 

Sourabh Bajaj commented on BEAM-541:


Yes I can help with the documentation of the NewDoFn in Python.

> Add more documentation on DoFn Annotations
> --
>
> Key: BEAM-541
> URL: https://issues.apache.org/jira/browse/BEAM-541
> Project: Beam
>  Issue Type: Wish
>  Components: website
>Reporter: Frances Perry
>Assignee: Kenneth Knowles
>Priority: Minor
> Fix For: First stable release
>
>
> https://github.com/apache/incubator-beam-site/pull/36 made the basic 
> documentation changes that correspond to BEAM-498, but we should add more 
> details on how to use the advance configurations for window access, etc.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1730) Python post commits timing out

2017-03-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930762#comment-15930762
 ] 

Sourabh Bajaj commented on BEAM-1730:
-

Please close

> Python post commits timing out
> --
>
> Key: BEAM-1730
> URL: https://issues.apache.org/jira/browse/BEAM-1730
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
> Fix For: Not applicable
>
>
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1514/
> Dataflow worker logs:
> https://pantheon.corp.google.com/logs/viewer?logName=projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup&project=apache-beam-testing&resource=dataflow_step%2Fjob_id%2F2017-03-15_10_30_20-1232538583050549167&minLogLevel=0&expandAll=false×tamp=2017-03-15T19:15:33.0Z
> Stack trace:
> I  Executing: /usr/bin/python -m dataflow_worker.start 
> -Djob_id=2017-03-15_10_30_20-1232538583050549167 
> -Dproject_id=apache-beam-testing -Dreporting_enabled=True 
> -Droot_url=https://dataflow.googleapis.com 
> -Dservice_path=https://dataflow.googleapis.com/ 
> -Dtemp_gcs_directory=gs://unused 
> -Dworker_id=py-validatesrunner-148959-03151030-f51b-harness-t79h 
> -Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log
>  -Dlocal_staging_directory=/var/opt/google/dataflow 
> -Dsdk_pipeline_options={"display_data":[{"key":"requirements_file","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"sdk_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"dist/apache-beam-0.7.0.dev0.tar.gz"},{"key":"num_workers","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"runner","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"staging_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"project","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"temp_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"job_name","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"py-validatesrunner-1489599007"}],"options":{"autoscalingAlgorithm":"NONE","dataflowJobId":"2017-03-15_10_30_20-1232538583050549167","dataflow_endpoint":"https://dataflow.googleapis.com","direct_runner_use_stacked_bundle":true,"gcpTempLocation":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","job_name":"py-validatesrunner-1489599007","male":false,"maxNumWorkers":0,"mock_flag":false,"no_auth":false,"numWorkers":1,"num_workers":1,"pipeline_type_check":true,"profile_cpu":false,"profile_memory":false,"project":"apache-beam-testing","requirements_file":"postcommit_requirements.txt","runner":"TestDataflowRunner","runtime_type_check":false,"save_main_session":false,"sdk_location":"dist/apache-beam-0.7.0.dev0.tar.gz","staging_location":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","streaming":false,"style":"scrambled","temp_location":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","type_check_strictness":"DEFAULT_TO_ANY"}}
>  
> I  Traceback (most recent call last): 
> IFile "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 
> I   
> I  "__main__", fname, loader, pkg_name) 
> IFile "/usr/lib/python2.7/runpy.py", line 72, in _run_code 
> I   
> I  exec code in run_globals 
> IFile "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", 
> line 26, in  
> I   
> I  from dataflow_worker import batchworker 
> IFile 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 51, in  
> I   
> I  from apache_beam.internal import pickler 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/__init__.py", 
> line 85, in  
> I   
> I  __version__ = version.get_version() 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", 
> line 30, in get_version 
> I   
> I  __version__ = get_version_from_pom() 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", 
> line 36, in get_version_from_pom

[jira] [Commented] (BEAM-1365) Bigquery dataset names should allow for . inside them

2017-03-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930760#comment-15930760
 ] 

Sourabh Bajaj commented on BEAM-1365:
-

We didn't reach a conclusion on if we want it or not. Can you mark this as 
"Won't Fix"

> Bigquery dataset names should allow for . inside them
> -
>
> Key: BEAM-1365
> URL: https://issues.apache.org/jira/browse/BEAM-1365
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
>
> Bigquery datasets allow for . inside them but the regex in 
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/io/bigquery.py#L305
>  doesn't allow for this. 
> Change the regex to 
> r'^((?P.+):)?(?P[\w\.]+)\.(?P[\w\$]+)$', table)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-693) pydoc is not working

2017-03-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930756#comment-15930756
 ] 

Sourabh Bajaj commented on BEAM-693:


Yes

> pydoc is not working
> 
>
> Key: BEAM-693
> URL: https://issues.apache.org/jira/browse/BEAM-693
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>Priority: Minor
>
> Repro:
> Start the pydoc server (pydoc -p ) and navigate to the apache_beam root:
> http://localhost:/apache_beam.html
> Following errors are shown instead of the actual documentation:
> problem in apache_beam - : No module named avro
> problem in apache_beam - : cannot import name 
> coders



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-804) Python Pipeline Option save_main_session non-functional

2017-03-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-804?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930759#comment-15930759
 ] 

Sourabh Bajaj commented on BEAM-804:


Yes

> Python Pipeline Option save_main_session non-functional
> ---
>
> Key: BEAM-804
> URL: https://issues.apache.org/jira/browse/BEAM-804
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
> Environment: OSX El Capitan, google-cloud-dataflow==0.4.3, python 
> 2.7.12
>Reporter: Zoran Bujila
>Assignee: Sourabh Bajaj
>Priority: Critical
>
> When trying to use the option --save_main_session a pickling error occurs.
> pickle.PicklingError: Can't pickle  'apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnum'>:
>  it's not found as 
> apache_beam.internal.clients.dataflow.dataflow_v1b3_messages.TypeValueValuesEnu
> This prevents the use of this option which is desirable as there is an 
> expensive object that needs to be created on each worker in my pipeline and I 
> would like to have this object created only once per worker. It is not 
> practical to have it inline with the ParDo function unless I make the batch 
> size sent to the ParDo quite large. Doing this seems to lead to idle workers 
> and I would ideally want to bring the batch size way down.
> The "Affects Version" option above doesn't have a 0.4.3 version in the drop 
> down so I did not populate it. However, this was a problem with 0.4.1 and has 
> not been corrected with 0.4.3.
> I don't see where I can attach a file, so here is the entire error.
> 2016-10-24 10:00:16,071 The 
> oauth2client.contrib.multistore_file module has been deprecated and will be 
> removed in the next release of oauth2client. Please migrate to 
> multiprocess_file_storage.
> 2016-10-24 10:00:16,127 __init__Direct usage of TextFileSink is 
> deprecated. Please use 'textio.WriteToText()' instead of directly 
> instantiating a TextFileSink object.
> Traceback (most recent call last):
>   File "00test.py", line 41, in 
> p.run()
>   File "/usr/local/lib/python2.7/site-packages/apache_beam/pipeline.py", line 
> 159, in run
> return self.runner.run(self)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/runners/dataflow_runner.py",
>  line 172, in run
> self.dataflow_client.create_job(self.job))
>   File "/usr/local/lib/python2.7/site-packages/apache_beam/utils/retry.py", 
> line 160, in wrapper
> return fun(*args, **kwargs)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/internal/apiclient.py", 
> line 375, in create_job
> job.options, file_copy=self._gcs_file_copy)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/utils/dependency.py", 
> line 325, in stage_job_resources
> pickler.dump_session(pickled_session_file)
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", 
> line 204, in dump_session
> return dill.dump_session(file_path)
>   File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 333, in 
> dump_session
> pickler.dump(main)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 224, in dump
> self.save(obj)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", 
> line 123, in save_module
> return old_save_module(pickler, obj)
>   File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 1168, in 
> save_module
> state=_main_dict)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 425, in save_reduce
> save(state)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 286, in save
> f(self, obj) # Call unbound method with explicit self
>   File 
> "/usr/local/lib/python2.7/site-packages/apache_beam/internal/pickler.py", 
> line 159, in new_save_module_dict
> return old_save_module_dict(pickler, obj)
>   File "/usr/local/lib/python2.7/site-packages/dill/dill.py", line 835, in 
> save_module_dict
> StockPickler.save_dict(pickler, obj)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 655, in save_dict
> self._batch_setitems(obj.iteritems())
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pickle.py",
>  line 687, in _batch_setitems
> save(v)
>   File 
> "/usr/local/Cellar/python/2.7.12_2/Frameworks/Python.framework/V

[jira] [Commented] (BEAM-782) Resolve runners in a case-insensitive manner.

2017-03-17 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15930754#comment-15930754
 ] 

Sourabh Bajaj commented on BEAM-782:


Yes

>  Resolve runners in a case-insensitive manner.
> --
>
> Key: BEAM-782
> URL: https://issues.apache.org/jira/browse/BEAM-782
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>  Labels: sdk-consistency
>
> See:
> https://github.com/apache/incubator-beam/pull/1087
> https://issues.apache.org/jira/browse/BEAM-770
> e.g. the DirectRunner can be specified with (among others) any of
> "--runner=direct", "--runner=directrunner", "--runner=DirectRunner",
> "--runner=Direct", or "--runner=directRunner"



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-547) Align Python SDK version with Maven

2017-03-15 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-547?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15927160#comment-15927160
 ] 

Sourabh Bajaj commented on BEAM-547:


We can not read the version directly from pom.xml as that is not part of the 
python package when a user does `pip install apache_beam` so I think we can 
simplify the process by just directly storing the version string in versions.py 
is probably best.

> Align Python SDK version with Maven
> ---
>
> Key: BEAM-547
> URL: https://issues.apache.org/jira/browse/BEAM-547
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Affects Versions: 0.3.0-incubating
>Reporter: Sergio Fernández
>Assignee: Frances Perry
>Priority: Minor
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> In BEAM-378 we've integrated the Python SDK in the main Maven build. 
> Initially I wanted to also align versions, but after discussing it with 
> [~silv...@google.com] we kept that aside for the moment. 
> Closing [PR #537|https://github.com/apache/incubator-beam/pull/537] [~altay] 
> brings the issue back. So it may make sense to revisit that idea.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1731) RuntimeError when running wordcount with ValueProviders

2017-03-15 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1731:
---

 Summary: RuntimeError when running wordcount with ValueProviders
 Key: BEAM-1731
 URL: https://issues.apache.org/jira/browse/BEAM-1731
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: María GH


Running: python -m apache_beam.examples.wordcount

INFO:root:Job 2017-03-15_13_39_59-3092873759767386 is in state JOB_STATE_FAILED
Traceback (most recent call last):
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
 line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
  File 
"/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
 line 72, in _run_code
exec code in run_globals
  File 
"/Users/sourabhbajaj/Projects/incubator-beam/sdks/python/apache_beam/examples/wordcount.py",
 line 119, in 
run()
  File 
"/Users/sourabhbajaj/Projects/incubator-beam/sdks/python/apache_beam/examples/wordcount.py",
 line 109, in run
result.wait_until_finish()
  File "apache_beam/runners/dataflow/dataflow_runner.py", line 711, in 
wait_until_finish
(self.state, getattr(self._runner, 'last_error_msg', None)), self)
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow 
pipeline failed. State: FAILED, Error:
(e22fabbb61bfae00): Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", 
line 544, in do_work
work_executor.execute()
  File "dataflow_worker/executor.py", line 1013, in 
dataflow_worker.executor.CustomSourceSplitExecutor.execute 
(dataflow_worker/executor.c:31501)
self.response = self._perform_source_split_considering_api_limits(
  File "dataflow_worker/executor.py", line 1021, in 
dataflow_worker.executor.CustomSourceSplitExecutor._perform_source_split_considering_api_limits
 (dataflow_worker/executor.c:31703)
split_response = self._perform_source_split(source_operation_split_task,
  File "dataflow_worker/executor.py", line 1059, in 
dataflow_worker.executor.CustomSourceSplitExecutor._perform_source_split 
(dataflow_worker/executor.c:32341)
for split in source.split(desired_bundle_size):
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/io/filebasedsource.py", 
line 192, in split
return self._get_concat_source().split(
  File 
"/usr/local/lib/python2.7/dist-packages/apache_beam/utils/value_provider.py", 
line 105, in _f
raise RuntimeError('%s not accessible' % obj)
RuntimeError: RuntimeValueProvider(option: input, type: str, default_value: 
'gs://dataflow-samples/shakespeare/kinglear.txt') not accessible




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1730) Python post commits timing out

2017-03-15 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1730?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926928#comment-15926928
 ] 

Sourabh Bajaj commented on BEAM-1730:
-

Reverting the changes in the commit as this needs broader discussion.

> Python post commits timing out
> --
>
> Key: BEAM-1730
> URL: https://issues.apache.org/jira/browse/BEAM-1730
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>
> https://builds.apache.org/view/Beam/job/beam_PostCommit_Python_Verify/1514/
> Dataflow worker logs:
> https://pantheon.corp.google.com/logs/viewer?logName=projects%2Fapache-beam-testing%2Flogs%2Fdataflow.googleapis.com%252Fworker-startup&project=apache-beam-testing&resource=dataflow_step%2Fjob_id%2F2017-03-15_10_30_20-1232538583050549167&minLogLevel=0&expandAll=false×tamp=2017-03-15T19:15:33.0Z
> Stack trace:
> I  Executing: /usr/bin/python -m dataflow_worker.start 
> -Djob_id=2017-03-15_10_30_20-1232538583050549167 
> -Dproject_id=apache-beam-testing -Dreporting_enabled=True 
> -Droot_url=https://dataflow.googleapis.com 
> -Dservice_path=https://dataflow.googleapis.com/ 
> -Dtemp_gcs_directory=gs://unused 
> -Dworker_id=py-validatesrunner-148959-03151030-f51b-harness-t79h 
> -Ddataflow.worker.logging.location=/var/log/dataflow/python-dataflow-0-json.log
>  -Dlocal_staging_directory=/var/opt/google/dataflow 
> -Dsdk_pipeline_options={"display_data":[{"key":"requirements_file","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"postcommit_requirements.txt"},{"key":"sdk_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"dist/apache-beam-0.7.0.dev0.tar.gz"},{"key":"num_workers","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"INTEGER","value":1},{"key":"runner","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"TestDataflowRunner"},{"key":"staging_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"project","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"apache-beam-testing"},{"key":"temp_location","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097"},{"key":"job_name","namespace":"apache_beam.utils.pipeline_options.PipelineOptions","type":"STRING","value":"py-validatesrunner-1489599007"}],"options":{"autoscalingAlgorithm":"NONE","dataflowJobId":"2017-03-15_10_30_20-1232538583050549167","dataflow_endpoint":"https://dataflow.googleapis.com","direct_runner_use_stacked_bundle":true,"gcpTempLocation":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","job_name":"py-validatesrunner-1489599007","male":false,"maxNumWorkers":0,"mock_flag":false,"no_auth":false,"numWorkers":1,"num_workers":1,"pipeline_type_check":true,"profile_cpu":false,"profile_memory":false,"project":"apache-beam-testing","requirements_file":"postcommit_requirements.txt","runner":"TestDataflowRunner","runtime_type_check":false,"save_main_session":false,"sdk_location":"dist/apache-beam-0.7.0.dev0.tar.gz","staging_location":"gs://temp-storage-for-end-to-end-tests/staging-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","streaming":false,"style":"scrambled","temp_location":"gs://temp-storage-for-end-to-end-tests/temp-validatesrunner-test/py-validatesrunner-1489599007.1489599009.190097","type_check_strictness":"DEFAULT_TO_ANY"}}
>  
> I  Traceback (most recent call last): 
> IFile "/usr/lib/python2.7/runpy.py", line 162, in _run_module_as_main 
> I   
> I  "__main__", fname, loader, pkg_name) 
> IFile "/usr/lib/python2.7/runpy.py", line 72, in _run_code 
> I   
> I  exec code in run_globals 
> IFile "/usr/local/lib/python2.7/dist-packages/dataflow_worker/start.py", 
> line 26, in  
> I   
> I  from dataflow_worker import batchworker 
> IFile 
> "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 
> 51, in  
> I   
> I  from apache_beam.internal import pickler 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/__init__.py", 
> line 85, in  
> I   
> I  __version__ = version.get_version() 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", 
> line 30, in get_version 
> I   
> I  __version__ = get_version_from_pom() 
> IFile "/usr/local/lib/python2.7/dist-packages/apache_beam/version.py", 
> line 36, in get

[jira] [Created] (BEAM-1708) Better error messages when GCP features are not installed

2017-03-13 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1708:
---

 Summary: Better error messages when GCP features are not installed 
 Key: BEAM-1708
 URL: https://issues.apache.org/jira/browse/BEAM-1708
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1320) Add sphinx or pydocs documentation for python-sdk

2017-03-02 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893064#comment-15893064
 ] 

Sourabh Bajaj commented on BEAM-1320:
-

[~altay] can you resolve this ?


> Add sphinx or pydocs documentation for python-sdk
> -
>
> Key: BEAM-1320
> URL: https://issues.apache.org/jira/browse/BEAM-1320
> Project: Beam
>  Issue Type: Task
>  Components: sdk-py
>Reporter: Ahmet Altay
>Assignee: Sourabh Bajaj
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1570) broken link on website (ostatic.com)

2017-03-02 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893060#comment-15893060
 ] 

Sourabh Bajaj commented on BEAM-1570:
-

[~sisk] can you resolve this ?

> broken link on website (ostatic.com)
> 
>
> Key: BEAM-1570
> URL: https://issues.apache.org/jira/browse/BEAM-1570
> Project: Beam
>  Issue Type: Bug
>  Components: website
>Reporter: Stephen Sisk
>Assignee: Sourabh Bajaj
>
> I see the following error when running the website test suite locally and on 
> jenkins: 
>   *  External link 
> http://ostatic.com/blog/apache-beam-unifies-batch-and-streaming-data-processing
>  failed: got a time out (response code 0)
> It looks like ostatic's website is down. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1264) Python ChannelFactory Raise Inconsistent Error for Local FS and GCS

2017-03-02 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893057#comment-15893057
 ] 

Sourabh Bajaj commented on BEAM-1264:
-

[~markflyhigh] do you have an example for this. I think I have resolved most of 
it in https://github.com/apache/beam/pull/2136 but wanted to verify it before 
resolving.

> Python ChannelFactory Raise Inconsistent Error for Local FS and GCS
> ---
>
> Key: BEAM-1264
> URL: https://issues.apache.org/jira/browse/BEAM-1264
> Project: Beam
>  Issue Type: Bug
>  Components: beam-model, sdk-py
>Reporter: Mark Liu
>Assignee: Sourabh Bajaj
>
> ChannelFactory raises different errors for local fs (RuntimeError) and GCS 
> (IOError) when reading failed. 
> We want to return consistent error for both.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-994) Support for S3 file source and sink

2017-03-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-994:
--

Assignee: (was: Sourabh Bajaj)

> Support for S3 file source and sink
> ---
>
> Key: BEAM-994
> URL: https://issues.apache.org/jira/browse/BEAM-994
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Sourabh Bajaj
>Priority: Minor
>
> http://stackoverflow.com/questions/40624544/what-is-best-practice-of-the-the-case-of-writing-text-output-into-s3-bucket
>  is one of the examples of the need for such a feature. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Closed] (BEAM-1082) Use use_standard_sql flag everywhere instead of use_legacy_sql

2017-03-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj closed BEAM-1082.
---
   Resolution: Won't Fix
Fix Version/s: Not applicable

It was decided to leave this as it is.

> Use use_standard_sql flag everywhere instead of use_legacy_sql
> --
>
> Key: BEAM-1082
> URL: https://issues.apache.org/jira/browse/BEAM-1082
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
> Fix For: Not applicable
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (BEAM-994) Support for S3 file source and sink

2017-03-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-994?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj reassigned BEAM-994:
--

Assignee: Sourabh Bajaj

> Support for S3 file source and sink
> ---
>
> Key: BEAM-994
> URL: https://issues.apache.org/jira/browse/BEAM-994
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-java-extensions
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
>
> http://stackoverflow.com/questions/40624544/what-is-best-practice-of-the-the-case-of-writing-text-output-into-s3-bucket
>  is one of the examples of the need for such a feature. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1528) Remove all sphinx warnings from the pydocs

2017-03-02 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1528.
-
Resolution: Fixed

> Remove all sphinx warnings from the pydocs
> --
>
> Key: BEAM-1528
> URL: https://issues.apache.org/jira/browse/BEAM-1528
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Ahmet Altay
>Priority: Minor
> Fix For: Not applicable
>
>
> There are still 4 open warnings in the pydoc generation which should be 
> fixed. Once this is over please add a test to the generation script that 
> changes all warnings to errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1528) Remove all sphinx warnings from the pydocs

2017-03-02 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1528?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15892896#comment-15892896
 ] 

Sourabh Bajaj commented on BEAM-1528:
-

Yes, I'll resolve it.

> Remove all sphinx warnings from the pydocs
> --
>
> Key: BEAM-1528
> URL: https://issues.apache.org/jira/browse/BEAM-1528
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Ahmet Altay
>Priority: Minor
> Fix For: Not applicable
>
>
> There are still 4 open warnings in the pydoc generation which should be 
> fixed. Once this is over please add a test to the generation script that 
> changes all warnings to errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk

2017-03-01 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1585:

Description: 
BEAM-1441 implements the new BeamFileSystem in the python SDK but currently 
lacks the ability to add user implemented file systems.

This needs to be executed in the worker so should be packaged correctly with 
the pipeline code. 

> Ability to add new file systems to beamFS in the python sdk
> ---
>
> Key: BEAM-1585
> URL: https://issues.apache.org/jira/browse/BEAM-1585
> Project: Beam
>  Issue Type: New Feature
>  Components: sdk-py
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>
> BEAM-1441 implements the new BeamFileSystem in the python SDK but currently 
> lacks the ability to add user implemented file systems.
> This needs to be executed in the worker so should be packaged correctly with 
> the pipeline code. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1585) Ability to add new file systems to beamFS in the python sdk

2017-03-01 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1585:
---

 Summary: Ability to add new file systems to beamFS in the python 
sdk
 Key: BEAM-1585
 URL: https://issues.apache.org/jira/browse/BEAM-1585
 Project: Beam
  Issue Type: New Feature
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1570) broken link on website (ostatic.com)

2017-03-01 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890755#comment-15890755
 ] 

Sourabh Bajaj commented on BEAM-1570:
-

This is fixed.

> broken link on website (ostatic.com)
> 
>
> Key: BEAM-1570
> URL: https://issues.apache.org/jira/browse/BEAM-1570
> Project: Beam
>  Issue Type: Bug
>  Components: website
>Reporter: Stephen Sisk
>Assignee: Sourabh Bajaj
>
> I see the following error when running the website test suite locally and on 
> jenkins: 
>   *  External link 
> http://ostatic.com/blog/apache-beam-unifies-batch-and-streaming-data-processing
>  failed: got a time out (response code 0)
> It looks like ostatic's website is down. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1545) Python sdk example run failed

2017-03-01 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15890756#comment-15890756
 ] 

Sourabh Bajaj commented on BEAM-1545:
-

This is fixed so we should close this.

> Python sdk example run failed
> -
>
> Key: BEAM-1545
> URL: https://issues.apache.org/jira/browse/BEAM-1545
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Haoxiang
>Assignee: Sourabh Bajaj
>
> When I run the python sdk example with 
> https://beam.apache.org/get-started/quickstart-py/ show, run the command:
> python -m apache_beam.examples.wordcount --input 
> gs://dataflow-samples/shakespeare/kinglear.txt --output output.txt
> it was failed by the logs:
> INFO:root:Missing pipeline option (runner). Executing pipeline using the 
> default runner: DirectRunner.
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 162, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File 
> "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py",
>  line 107, in 
> run()
>   File 
> "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py",
>  line 83, in run
> lines = p | 'read' >> ReadFromText(known_args.input)
>   File "apache_beam/io/textio.py", line 378, in __init__
> skip_header_lines=skip_header_lines)
>   File "apache_beam/io/textio.py", line 87, in __init__
> validate=validate)
>   File "apache_beam/io/filebasedsource.py", line 97, in __init__
> self._validate()
>   File "apache_beam/io/filebasedsource.py", line 171, in _validate
> if len(fileio.ChannelFactory.glob(self._pattern, limit=1)) <= 0:
>   File "apache_beam/io/fileio.py", line 281, in glob
> return gcsio.GcsIO().glob(path, limit)
> AttributeError: 'NoneType' object has no attribute 'GcsIO'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (BEAM-1545) Python sdk example run failed

2017-02-24 Thread Sourabh Bajaj (JIRA)

[ 
https://issues.apache.org/jira/browse/BEAM-1545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15883455#comment-15883455
 ] 

Sourabh Bajaj commented on BEAM-1545:
-

You should run `pip install -e .[gcp,test]` instead of the python setup.py 
install as Ahmet suggested. That should fix the issue you're facing. 

> Python sdk example run failed
> -
>
> Key: BEAM-1545
> URL: https://issues.apache.org/jira/browse/BEAM-1545
> Project: Beam
>  Issue Type: Bug
>  Components: sdk-py
>Reporter: Haoxiang
>Assignee: Sourabh Bajaj
>
> When I run the python sdk example with 
> https://beam.apache.org/get-started/quickstart-py/ show, run the command:
> python -m apache_beam.examples.wordcount --input 
> gs://dataflow-samples/shakespeare/kinglear.txt --output output.txt
> it was failed by the logs:
> INFO:root:Missing pipeline option (runner). Executing pipeline using the 
> default runner: DirectRunner.
> Traceback (most recent call last):
>   File 
> "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 162, in _run_module_as_main
> "__main__", fname, loader, pkg_name)
>   File 
> "/usr/local/Cellar/python/2.7.11/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py",
>  line 72, in _run_code
> exec code in run_globals
>   File 
> "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py",
>  line 107, in 
> run()
>   File 
> "/Users/haoxiang/InterestingGitProject/beam/sdks/python/apache_beam/examples/wordcount.py",
>  line 83, in run
> lines = p | 'read' >> ReadFromText(known_args.input)
>   File "apache_beam/io/textio.py", line 378, in __init__
> skip_header_lines=skip_header_lines)
>   File "apache_beam/io/textio.py", line 87, in __init__
> validate=validate)
>   File "apache_beam/io/filebasedsource.py", line 97, in __init__
> self._validate()
>   File "apache_beam/io/filebasedsource.py", line 171, in _validate
> if len(fileio.ChannelFactory.glob(self._pattern, limit=1)) <= 0:
>   File "apache_beam/io/fileio.py", line 281, in glob
> return gcsio.GcsIO().glob(path, limit)
> AttributeError: 'NoneType' object has no attribute 'GcsIO'



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1528) Remove all sphinx warnings from the pydocs

2017-02-21 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1528:
---

 Summary: Remove all sphinx warnings from the pydocs
 Key: BEAM-1528
 URL: https://issues.apache.org/jira/browse/BEAM-1528
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay
Priority: Minor
 Fix For: Not applicable


There are still 4 open warnings in the pydoc generation which should be fixed. 
Once this is over please add a test to the generation script that changes all 
warnings to errors.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1527) Check if the dependencies protorpc and python-gflags can be removed

2017-02-21 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1527:
---

 Summary: Check if the dependencies protorpc and python-gflags can 
be removed
 Key: BEAM-1527
 URL: https://issues.apache.org/jira/browse/BEAM-1527
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay
Priority: Minor
 Fix For: Not applicable


These dependencies might be unused and should be removed in that case. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1509) Python sdk pom runs only on unix

2017-02-17 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1509:
---

 Summary: Python sdk pom runs only on unix
 Key: BEAM-1509
 URL: https://issues.apache.org/jira/browse/BEAM-1509
 Project: Beam
  Issue Type: Bug
  Components: sdk-py
Affects Versions: Not applicable
Reporter: Sourabh Bajaj
Assignee: Ahmet Altay
Priority: Minor


https://github.com/apache/beam/blob/master/sdks/python/pom.xml#L133

There are lines like this in the pom.xml file in sdks/python which won't work 
on windows OS. So we should investigate running this correctly.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1473) Remove unused windmill proto files from python sdk

2017-02-16 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1473.
-
Resolution: Fixed

> Remove unused windmill proto files from python sdk
> --
>
> Key: BEAM-1473
> URL: https://issues.apache.org/jira/browse/BEAM-1473
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Affects Versions: Not applicable
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: Not applicable
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py
>  
> There are two unused windmill files in beam that should be cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (BEAM-1473) Remove unused windmill proto files from python sdk

2017-02-13 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1473?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj updated BEAM-1473:

Description: 
https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py
 

There are two unused windmill files in beam that should be cleaned.

> Remove unused windmill proto files from python sdk
> --
>
> Key: BEAM-1473
> URL: https://issues.apache.org/jira/browse/BEAM-1473
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Affects Versions: Not applicable
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: Not applicable
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/internal/windmill_pb2.py
>  
> There are two unused windmill files in beam that should be cleaned.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1473) Remove unused windmill proto files from python sdk

2017-02-13 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1473:
---

 Summary: Remove unused windmill proto files from python sdk
 Key: BEAM-1473
 URL: https://issues.apache.org/jira/browse/BEAM-1473
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Affects Versions: Not applicable
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj
Priority: Minor
 Fix For: Not applicable






--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1464) Make sure all tests use the TestPipeline class

2017-02-13 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1464.
-
Resolution: Fixed

> Make sure all tests use the TestPipeline class
> --
>
> Key: BEAM-1464
> URL: https://issues.apache.org/jira/browse/BEAM-1464
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Affects Versions: Not applicable
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: Not applicable
>
>
> Some tests still specify using the DirectRunner. They should all use the 
> TestRunner in those cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (BEAM-1464) Make sure all tests use the TestPipeline class

2017-02-10 Thread Sourabh Bajaj (JIRA)
Sourabh Bajaj created BEAM-1464:
---

 Summary: Make sure all tests use the TestPipeline class
 Key: BEAM-1464
 URL: https://issues.apache.org/jira/browse/BEAM-1464
 Project: Beam
  Issue Type: Improvement
  Components: sdk-py
Affects Versions: Not applicable
Reporter: Sourabh Bajaj
Assignee: Sourabh Bajaj
Priority: Minor
 Fix For: Not applicable


Some tests still specify using the DirectRunner. They should all use the 
TestRunner in those cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1430) PartitionFn should access only the element and not the context

2017-02-10 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1430.
-
   Resolution: Fixed
Fix Version/s: Not applicable

> PartitionFn should access only the element and not the context
> --
>
> Key: BEAM-1430
> URL: https://issues.apache.org/jira/browse/BEAM-1430
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Affects Versions: Not applicable
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: Not applicable
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/core.py#L1182
>  should only access the element and not the full context.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (BEAM-1429) WindowFn assign should only have access to the element and timestamp

2017-02-10 Thread Sourabh Bajaj (JIRA)

 [ 
https://issues.apache.org/jira/browse/BEAM-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sourabh Bajaj resolved BEAM-1429.
-
   Resolution: Fixed
Fix Version/s: Not applicable

> WindowFn assign should only have access to the element and timestamp
> 
>
> Key: BEAM-1429
> URL: https://issues.apache.org/jira/browse/BEAM-1429
> Project: Beam
>  Issue Type: Improvement
>  Components: sdk-py
>Affects Versions: Not applicable
>Reporter: Sourabh Bajaj
>Assignee: Sourabh Bajaj
>Priority: Minor
> Fix For: Not applicable
>
>
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/transforms/window.py#L97
>  does not really use access to multiple windows at the same time. 
> Also as we move to the NewDoFn we have moved away from accessing the 
> windowSet together. Change the function here to only take the element and 
> timestamp as the input. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


  1   2   >