Re: Implementing @OnWindowExpiration in StatefulParDo [BEAM-1589]

2018-03-20 Thread Romain Manni-Bucau
Hi Huygaa,

Cant it be predefined timers?

Romain

Le 20 mars 2018 00:52, "Huygaa Batsaikhan"  a écrit :

Hi everyone, I am working on BEAM-1589
. In short, currently,
there is no default way of saving/flushing state before a window is garbage
collected.

My current plan is to provide a method annotation, @OnWindowExpiration,
which allows user-provided callback function to be executed before garbage
collection. This annotation behaves very similar to @OnTimer, therefore,
implementation will mostly be a copy of OnTimer code. Let me know if you
have any considerations and suggestions.

Here is an example usage:
```
@OnWindowExpiration
public void myCleanupFunction(OnWindowExpirationContext c, State state) {
  c.output(state.read());
}
```

Thanks, Huygaa


Build failed in Jenkins: beam_Release_NightlySnapshot #719

2018-03-20 Thread Apache Jenkins Server
See 


Changes:

[sidhom] Add ExecutableStagePayload to make aid runner stage reconstruction

[sidhom] Fix typo

[tgroh] Use InstructionRequestHandler in RemoteEnvironment

[axelmagn] Add a generic interface for the state service.

[herohde] Remove WindowedValue on PCollections for Go SDK

[herohde] CR: Fix comments to remove old windowing

[mariagh] Add support for streaming side inputs in the DirectRunner

[ccy] Add support for PaneInfo in WindowedValues

[axelmagn] Write unit tests for GrpcStateService.

[XuMingmin] Bump calcite and avatica versions (#4887)

--
[...truncated 44.23 KB...]
2018-03-20T07:03:48.806 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.806 [INFO] 

2018-03-20T07:03:48.806 [INFO] 
2018-03-20T07:03:48.806 [INFO] 

2018-03-20T07:03:48.806 [INFO] Skipping Apache Beam :: SDKs :: Python :: 
Container
2018-03-20T07:03:48.806 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.806 [INFO] 

2018-03-20T07:03:48.806 [INFO] 
2018-03-20T07:03:48.806 [INFO] 

2018-03-20T07:03:48.806 [INFO] Skipping Apache Beam :: Runners :: Java Fn 
Execution
2018-03-20T07:03:48.806 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] Skipping Apache Beam :: Runners :: Java Local 
Artifact Service
2018-03-20T07:03:48.807 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] Skipping Apache Beam :: Runners :: Reference
2018-03-20T07:03:48.807 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] Skipping Apache Beam :: Runners :: Reference :: 
Java
2018-03-20T07:03:48.807 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] Skipping Apache Beam :: Runners :: Reference :: 
Job Orchestrator
2018-03-20T07:03:48.807 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] Skipping Apache Beam :: Runners :: Flink
2018-03-20T07:03:48.807 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.807 [INFO] 

2018-03-20T07:03:48.807 [INFO] 
2018-03-20T07:03:48.808 [INFO] 

2018-03-20T07:03:48.808 [INFO] Skipping Apache Beam :: Runners :: Gearpump
2018-03-20T07:03:48.808 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.808 [INFO] 

2018-03-20T07:03:48.808 [INFO] 
2018-03-20T07:03:48.808 [INFO] 

2018-03-20T07:03:48.808 [INFO] Skipping Apache Beam :: Runners :: Spark
2018-03-20T07:03:48.808 [INFO] This project has been banned from the build due 
to previous failures.
2018-03-20T07:03:48.808 [INFO] 

2018-03-20T07:03:48.808 [INFO] 
2018-03-20T07:03:48.808 [INFO] 

2018-03-20T07:03:48.808 [INFO] Skipping Apache Beam :: Runners :: Apex
2018-03-20T07:0

Re: Implementing @OnWindowExpiration in StatefulParDo [BEAM-1589]

2018-03-20 Thread Etienne Chauchot
Hi,
When coding GroupIntoBatches transform, I had similar need. I implemented it 
with an Ontimer callback set that way:
timer.set(window.maxTimestamp().plus(allowedLateness));
At the time Kenn suggested to set up a new @OnWindowExpiration annotation to do 
exactly that but more easily.
So big +1.
Etienne
Le lundi 19 mars 2018 à 23:51 +, Huygaa Batsaikhan a écrit :
> Hi everyone, I am working on BEAM-1589. In short, currently, there is no 
> default way of saving/flushing state before a
> window is garbage collected.
> 
> My current plan is to provide a method annotation, @OnWindowExpiration, which 
> allows user-provided callback function
> to be executed before garbage collection. This annotation behaves very 
> similar to @OnTimer, therefore, implementation
> will mostly be a copy of OnTimer code. Let me know if you have any 
> considerations and suggestions.
> 
> Here is an example usage:
> ```
> @OnWindowExpiration
> public void myCleanupFunction(OnWindowExpirationContext c, State state) {
>   c.output(state.read());
> }
> ```
> 
> Thanks, Huygaa

Re: Implementing @OnWindowExpiration in StatefulParDo [BEAM-1589]

2018-03-20 Thread Jean-Baptiste Onofré
+1

It sounds good to me.

Regards
JB

Le 20 mars 2018 à 00:52, à 00:52, Huygaa Batsaikhan  a écrit:
>Hi everyone, I am working on BEAM-1589
>. In short, currently,
>there is no default way of saving/flushing state before a window is
>garbage
>collected.
>
>My current plan is to provide a method annotation, @OnWindowExpiration,
>which allows user-provided callback function to be executed before
>garbage
>collection. This annotation behaves very similar to @OnTimer,
>therefore,
>implementation will mostly be a copy of OnTimer code. Let me know if
>you
>have any considerations and suggestions.
>
>Here is an example usage:
>```
>@OnWindowExpiration
>public void myCleanupFunction(OnWindowExpirationContext c, State state)
>{
>  c.output(state.read());
>}
>```
>
>Thanks, Huygaa


Re: [VOTE] Release 2.4.0, release candidate #3

2018-03-20 Thread Ismaël Mejía
+1 (binding)

- Validated hashs
- mvn clean verify -Prelease OK
- Run nexmark on direct/flink/spark (it works save the regression
already tracked on RC2).

Thanks Robert for being managing the release.

ps. One 'future'  question after seeing the wheel artifacts (that seem
to be python version / OS specific) introduced in this RC I was
wondering how can we validate that those are correct for future
releases (probably something to add to the validation guide or to
automatize).

On Tue, Mar 20, 2018 at 7:40 AM, Valentyn Tymofieiev
 wrote:
> I also tried to run python streaming mobile gaming examples
> (hourly_team_score, leader_board) on direct runner with little success:
> https://issues.apache.org/jira/browse/BEAM-3889. I think  they escaped our
> attention on previous release validations. I just tried them with 2.3.0 and
> didn't have much luck either. Since I am not sure when these examples last
> worked and warnings shows the examples are using deprecated constructs, I
> suspect the issue is with examples, and we can address it separately from
> 2.4.0 release.
>
> On Mon, Mar 19, 2018 at 7:21 PM, Valentyn Tymofieiev 
> wrote:
>>
>> +1.
>> Ran a Python Streaming wordcount pipeline on Direct and Dataflow runners
>> and Batch mobile gaming examples on Dataflow runner.
>>
>> On Mon, Mar 19, 2018 at 6:02 PM, Alan Myrvold  wrote:
>>>
>>> +1 I ran the java quickstarts against 2.4.0 and they passed.
>>> ./gradlew :release:runQuickstartsJava
>>> -Prepourl=https://repository.apache.org/content/repositories/orgapachebeam-1031
>>> -Pver=2.4.0
>>>
>>>
>>> On Mon, Mar 19, 2018 at 5:41 PM Ahmet Altay  wrote:

 I was able to run hourly_team_score. I was passing a wrong argument. No
 need for an alarm. :)

 On Mon, Mar 19, 2018 at 5:33 PM, Ahmet Altay  wrote:
>
> +1 Thank you Robert.
>
> Verified python mobile gaming examples using the wheel files on direct
> runner. Got user_score working but hourly_team_score failed with
> (https://issues.apache.org/jira/browse/BEAM-3824). Since this is an 
> example,
> I think it is fine to continue with the release. I will work on fixing the
> example post release.
>
> On Mon, Mar 19, 2018 at 2:46 PM, Konstantinos Katsiapis
>  wrote:
>>
>> +1, since Tf.Transform 0.6 depends on (and is blocked by) Beam 2.4
>>
>> On Sat, Mar 17, 2018 at 2:19 AM, Robert Bradshaw 
>> wrote:
>>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #3 for the version
>>> 2.4.0,
>>> as follows:
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2],
>>> which is signed with the key with fingerprint BDC9 89B0 1BD2 A463
>>> 6010 A1CA
>>> 8F15 5E09 610D 69FB [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.4.0-RC3" [5],
>>> * website pull request listing the release and publishing the API
>>> reference
>>> manual [6].
>>> * Java artifacts were built with Maven 3.2.5 and OpenJDK 1.8.0_112.
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>>
>>> The validation spreadsheet is available at
>>>
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475
>>>
>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> - Robert
>>>
>>> [1]
>>>
>>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342682&projectId=12319527
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
>>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1031/
>>> [5] https://github.com/apache/beam/tree/v2.4.0-RC3
>>> [6] https://github.com/apache/beam-site/pull/398
>>
>>
>>
>>
>> --
>> Gus Katsiapis | Software Engineer | katsia...@google.com |
>> 650-918-7487
>
>

>>
>


Help with Dynamic writing

2018-03-20 Thread OrielResearch Eila Arich-Landkof
Hello all,

It was nice to meet you last week!!!

I am writing genomic pCollection that is created from bigQuery to a folder.
Following is the code with output so you can run it with any small BQ table
and let me know what your thoughts are:

rows = [{u'index': u'GSM2313641', u'SNRPCP14': 0},{u'index': u'GSM231',
u'SNRPCP14': 0},{u'index': u'GSM2312355', u'SNRPCP14': 0},{u'index':
u'GSM2312372', u'SNRPCP14': 0}]

rows[1].keys()
# output:  [u'index', u'SNRPCP14']

# you can change `archs4.results_20180308_ to any other table name with
index column
queries2 = rows | beam.Map(lambda x:
(beam.io.Read(beam.io.BigQuerySource(project='orielresearch-188115',
use_standard_sql=False, query=str('SELECT * FROM
`archs4.results_20180308_*` where index=\'%s\'' % (x["index"],
   str('gs://archs4/output/'+x["index"]+'/')))

queries2
# output: a list of pCollection and the path to write the pCollection data
to

[(,
  'gs://archs4/output/GSM2313641/'),
 (,
  'gs://archs4/output/GSM231/'),
 (,
  'gs://archs4/output/GSM2312355/'),
 (,
  'gs://archs4/output/GSM2312372/')]


*# this is my challenge*
queries2 | 'write to relevant path' >> beam.io.WriteToText("SECOND COLUMN")

Do you have any idea how to sink the data to a text file? I have tried few
other options and was stuck at the write transform

Any advice is very appreciated.

Thanks,
Eila



-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/


PipelineOptions strict mode broken?

2018-03-20 Thread Romain Manni-Bucau
Hi guys,

PipelineOptionsFactory has a nice strict mode validating the options you
pass.

Concretely if you pass --sudoMakeItWork you will ikely see:

java.lang.IllegalArgumentException: Class interface
org.apache.beam.sdk.options.PipelineOptions missing a property named '
sudoMakeItWork'.

This is not bad however the way it is implemented just doesn't work and its
design is wrong:

1. the pipeline options factory relies on a cache so only the options
available when the class is instantiated are available. For example in a
container (web container, OSGi or other = not flat classpath) you will not
be able to load the IO options if the IO are not in the same classloader
than the sdk core which is quite a pitfall.
2. the validation leads between options. The validation is not "the options
are valid" but "there is some option matching your parameter"

A case which is broken but "green" today is (not using exact names for the
example):

--runner=DirectRunner --sparkMaster= spark://localhost

this will work if I have both runner in the classpath but it should
actually fail cause one is invalid for the other.

To fix that I see 3 options:

1. relax the strict mode to be false by default and lazily evaluate the
options
2. don't allow to lazily cast the options but enforce to do it when the
instance is created, this means that a user must know all PipelineOptions
children it relies on at creation time
(PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
S3Options.class, MyIOOptions.class)
3. use a namespace/prefix for nested options, this way the previous example
would become:

--runner=DirectRunner --spark.master=spark://localhost

in this case the PipelineOptions instantiation would validate the prefix ""
and if SparkOptions are requested they would validate the prefix spark.*.

This is not perfect but current state with a single eager registry for the
validation is very hardly usable as soon as you try to industrialize beam
in something else than a main :(.

Any one has an idea to not break the API?


Romain Manni-Bucau
@rmannibucau  |  Blog
 | Old Blog
 | Github  |
LinkedIn  | Book



Re: PipelineOptions strict mode broken?

2018-03-20 Thread Lukasz Cwik
The only current validator is the @Required validator, there were some
ideas to integrate another system to perform validation on options like >=0
for numbers. I'm not sure how much use this has gotten from users, I would
be for leaving it as is (if users get value out of it) or removing it and
mark it deprecated for removal in the next major version.

Have you taken a look at PipelineOptionsFactory.register(...)?
It allows you to register all PipelineOptions interfaces for
parsing/validation reasons similar to your suggestion for
PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
S3Options.class, MyIOOptions.class). Note that the user can also have their
PipelineOptions interface extend all interfaces that they want to ensure
are visible.

Note that PipelineOptions was created to be a flat bag of string to type
safe accessor/serializer/deserializer and nothing more. There has been
little discussion about making it hierarchical and what that would mean.

Also, wouldn't making PipelineOptionsFactory not a static singleton (like
how I suggested on your PR) address most of the classloader issues that you
speak of?


On Tue, Mar 20, 2018 at 9:26 AM Romain Manni-Bucau 
wrote:

> Hi guys,
>
> PipelineOptionsFactory has a nice strict mode validating the options you
> pass.
>
> Concretely if you pass --sudoMakeItWork you will ikely see:
>
> java.lang.IllegalArgumentException: Class interface
> org.apache.beam.sdk.options.PipelineOptions missing a property named '
> sudoMakeItWork'.
>
> This is not bad however the way it is implemented just doesn't work and
> its design is wrong:
>
> 1. the pipeline options factory relies on a cache so only the options
> available when the class is instantiated are available. For example in a
> container (web container, OSGi or other = not flat classpath) you will not
> be able to load the IO options if the IO are not in the same classloader
> than the sdk core which is quite a pitfall.
> 2. the validation leads between options. The validation is not "the
> options are valid" but "there is some option matching your parameter"
>
> A case which is broken but "green" today is (not using exact names for the
> example):
>
> --runner=DirectRunner --sparkMaster= spark://localhost
>
> this will work if I have both runner in the classpath but it should
> actually fail cause one is invalid for the other.
>
> To fix that I see 3 options:
>
> 1. relax the strict mode to be false by default and lazily evaluate the
> options
> 2. don't allow to lazily cast the options but enforce to do it when the
> instance is created, this means that a user must know all PipelineOptions
> children it relies on at creation time
> (PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
> S3Options.class, MyIOOptions.class)
> 3. use a namespace/prefix for nested options, this way the previous
> example would become:
>
> --runner=DirectRunner --spark.master=spark://localhost
>
> in this case the PipelineOptions instantiation would validate the prefix
> "" and if SparkOptions are requested they would validate the prefix spark.*.
>
> This is not perfect but current state with a single eager registry for the
> validation is very hardly usable as soon as you try to industrialize beam
> in something else than a main :(.
>
> Any one has an idea to not break the API?
>
>
> Romain Manni-Bucau
> @rmannibucau  |  Blog
>  | Old Blog
>  | Github
>  | LinkedIn
>  | Book
> 
>


Re: Help with Dynamic writing

2018-03-20 Thread Chamikara Jayalath
Hi Eila,

Please find my comments inline.

On Tue, Mar 20, 2018 at 8:02 AM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hello all,
>
> It was nice to meet you last week!!!
>
>
It was nice to meet you as well :)


> I am writing genomic pCollection that is created from bigQuery to a
> folder. Following is the code with output so you can run it with any small
> BQ table and let me know what your thoughts are:
>
> rows = [{u'index': u'GSM2313641', u'SNRPCP14': 0},{u'index':
> u'GSM231', u'SNRPCP14': 0},{u'index': u'GSM2312355', u'SNRPCP14':
> 0},{u'index': u'GSM2312372', u'SNRPCP14': 0}]
>
> rows[1].keys()
> # output:  [u'index', u'SNRPCP14']
>
> # you can change `archs4.results_20180308_ to any other table name with
> index column
> queries2 = rows | beam.Map(lambda x:
> (beam.io.Read(beam.io.BigQuerySource(project='orielresearch-188115',
> use_standard_sql=False, query=str('SELECT * FROM
> `archs4.results_20180308_*` where index=\'%s\'' % (x["index"],
>str('gs://archs4/output/'+x["index"]+'/')))
>

I don't think above code will work (not portable across runners at least).
BigQuerySource (along with Read transform) have to be applied to a Pipeline
object. So probably change this to a for loop that creates a set of read
transforms and use Flatten to create a single PCollection.


>
> queries2
> # output: a list of pCollection and the path to write the pCollection data
> to
>
> [(,
>   'gs://archs4/output/GSM2313641/'),
>  (,
>   'gs://archs4/output/GSM231/'),
>  (,
>   'gs://archs4/output/GSM2312355/'),
>  (,
>   'gs://archs4/output/GSM2312372/')]
>
>
What you got here is a PCollection of PTransform objects which is not
useful.


>
> *# this is my challenge*
> queries2 | 'write to relevant path' >> beam.io.WriteToText("SECOND COLUMN")
>
>
Once you update above code you will get a proper PCollection of elements
read from BigQuery. You can transform and write this (to files, BQ, or any
other sink) as needed.
Please see programming guide on how to write to text files (section 5.3 and
click Python tab): https://beam.apache.org/documentation/programming-guide/

Thanks,
Cham


> Do you have any idea how to sink the data to a text file? I have tried few
> other options and was stuck at the write transform
>
> Any advice is very appreciated.
>
> Thanks,
> Eila
>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetup.com/Deep-Learning-In-Production/
>


Re: PipelineOptions strict mode broken?

2018-03-20 Thread Romain Manni-Bucau
2018-03-20 17:53 GMT+01:00 Lukasz Cwik :

> The only current validator is the @Required validator, there were some
> ideas to integrate another system to perform validation on options like >=0
> for numbers. I'm not sure how much use this has gotten from users, I would
> be for leaving it as is (if users get value out of it) or removing it and
> mark it deprecated for removal in the next major version.
>
> Have you taken a look at PipelineOptionsFactory.register(...)?
> It allows you to register all PipelineOptions interfaces for
> parsing/validation reasons similar to your suggestion for
> PipelineOptionsFactory.fromArgs(args).create(DirectOptions.class,
> S3Options.class, MyIOOptions.class). Note that the user can also have their
> PipelineOptions interface extend all interfaces that they want to ensure
> are visible.
>

Sure, this is linked to the discussion we got on the Cache in POF PR. If
moved to something not cached anymore we should work OOTB but this requires
more deeper work since the cache is assumed ATM.

Also note that it would require one lookup per classloader and potentially
not shared between IOs (which is fine but just not as straight forward as
reading these lines can look).


>
> Note that PipelineOptions was created to be a flat bag of string to type
> safe accessor/serializer/deserializer and nothing more. There has been
> little discussion about making it hierarchical and what that would mean.
>
> Also, wouldn't making PipelineOptionsFactory not a static singleton (like
> how I suggested on your PR) address most of the classloader issues that you
> speak of?
>

Half, the "find it" case will work, the not wrongly failing validation case
would still be silent and not obvious if we dont split the configs in
namespaces which are easier to understand. That said it wouldnt be a
blocker anymore, you are right.


>
>
> On Tue, Mar 20, 2018 at 9:26 AM Romain Manni-Bucau 
> wrote:
>
>> Hi guys,
>>
>> PipelineOptionsFactory has a nice strict mode validating the options you
>> pass.
>>
>> Concretely if you pass --sudoMakeItWork you will ikely see:
>>
>> java.lang.IllegalArgumentException: Class interface
>> org.apache.beam.sdk.options.PipelineOptions missing a property named '
>> sudoMakeItWork'.
>>
>> This is not bad however the way it is implemented just doesn't work and
>> its design is wrong:
>>
>> 1. the pipeline options factory relies on a cache so only the options
>> available when the class is instantiated are available. For example in a
>> container (web container, OSGi or other = not flat classpath) you will not
>> be able to load the IO options if the IO are not in the same classloader
>> than the sdk core which is quite a pitfall.
>> 2. the validation leads between options. The validation is not "the
>> options are valid" but "there is some option matching your parameter"
>>
>> A case which is broken but "green" today is (not using exact names for
>> the example):
>>
>> --runner=DirectRunner --sparkMaster= spark://localhost
>>
>> this will work if I have both runner in the classpath but it should
>> actually fail cause one is invalid for the other.
>>
>> To fix that I see 3 options:
>>
>> 1. relax the strict mode to be false by default and lazily evaluate the
>> options
>> 2. don't allow to lazily cast the options but enforce to do it when the
>> instance is created, this means that a user must know all PipelineOptions
>> children it relies on at creation time (PipelineOptionsFactory.
>> fromArgs(args).create(DirectOptions.class, S3Options.class,
>> MyIOOptions.class)
>> 3. use a namespace/prefix for nested options, this way the previous
>> example would become:
>>
>> --runner=DirectRunner --spark.master=spark://localhost
>>
>> in this case the PipelineOptions instantiation would validate the prefix
>> "" and if SparkOptions are requested they would validate the prefix spark.*.
>>
>> This is not perfect but current state with a single eager registry for
>> the validation is very hardly usable as soon as you try to industrialize
>> beam in something else than a main :(.
>>
>> Any one has an idea to not break the API?
>>
>>
>> Romain Manni-Bucau
>> @rmannibucau  |  Blog
>>  | Old Blog
>>  | Github
>>  | LinkedIn
>>  | Book
>> 
>>
>


Re: [VOTE] Release 2.4.0, release candidate #3

2018-03-20 Thread Robert Bradshaw
On Tue, Mar 20, 2018 at 4:08 AM Ismaël Mejía  wrote:

> +1 (binding)
>
> - Validated hashs
> - mvn clean verify -Prelease OK
> - Run nexmark on direct/flink/spark (it works save the regression
> already tracked on RC2).
>
> Thanks Robert for being managing the release.
>
> ps. One 'future'  question after seeing the wheel artifacts (that seem
> to be python version / OS specific) introduced in this RC I was
> wondering how can we validate that those are correct for future
> releases (probably something to add to the validation guide or to
> automatize).
>

Good question. We should automate this. If their generation is automated
(and runs tests) we can also have higher confidence in their correctness.

On Tue, Mar 20, 2018 at 7:40 AM, Valentyn Tymofieiev
>  wrote:
> > I also tried to run python streaming mobile gaming examples
> > (hourly_team_score, leader_board) on direct runner with little success:
> > https://issues.apache.org/jira/browse/BEAM-3889. I think  they escaped
> our
> > attention on previous release validations. I just tried them with 2.3.0
> and
> > didn't have much luck either. Since I am not sure when these examples
> last
> > worked and warnings shows the examples are using deprecated constructs, I
> > suspect the issue is with examples, and we can address it separately from
> > 2.4.0 release.
> >
> > On Mon, Mar 19, 2018 at 7:21 PM, Valentyn Tymofieiev <
> valen...@google.com>
> > wrote:
> >>
> >> +1.
> >> Ran a Python Streaming wordcount pipeline on Direct and Dataflow runners
> >> and Batch mobile gaming examples on Dataflow runner.
> >>
> >> On Mon, Mar 19, 2018 at 6:02 PM, Alan Myrvold 
> wrote:
> >>>
> >>> +1 I ran the java quickstarts against 2.4.0 and they passed.
> >>> ./gradlew :release:runQuickstartsJava
> >>> -Prepourl=
> https://repository.apache.org/content/repositories/orgapachebeam-1031
> >>> -Pver=2.4.0
> >>>
> >>>
> >>> On Mon, Mar 19, 2018 at 5:41 PM Ahmet Altay  wrote:
> 
>  I was able to run hourly_team_score. I was passing a wrong argument.
> No
>  need for an alarm. :)
> 
>  On Mon, Mar 19, 2018 at 5:33 PM, Ahmet Altay 
> wrote:
> >
> > +1 Thank you Robert.
> >
> > Verified python mobile gaming examples using the wheel files on
> direct
> > runner. Got user_score working but hourly_team_score failed with
> > (https://issues.apache.org/jira/browse/BEAM-3824). Since this is an
> example,
> > I think it is fine to continue with the release. I will work on
> fixing the
> > example post release.
> >
> > On Mon, Mar 19, 2018 at 2:46 PM, Konstantinos Katsiapis
> >  wrote:
> >>
> >> +1, since Tf.Transform 0.6 depends on (and is blocked by) Beam 2.4
> >>
> >> On Sat, Mar 17, 2018 at 2:19 AM, Robert Bradshaw <
> rober...@google.com>
> >> wrote:
> >>>
> >>> Hi everyone,
> >>>
> >>> Please review and vote on the release candidate #3 for the version
> >>> 2.4.0,
> >>> as follows:
> >>> [ ] +1, Approve the release
> >>> [ ] -1, Do not approve the release (please provide specific
> comments)
> >>>
> >>> The complete staging area is available for your review, which
> >>> includes:
> >>> * JIRA release notes [1],
> >>> * the official Apache source release to be deployed to
> >>> dist.apache.org [2],
> >>> which is signed with the key with fingerprint BDC9 89B0 1BD2 A463
> >>> 6010 A1CA
> >>> 8F15 5E09 610D 69FB [3],
> >>> * all artifacts to be deployed to the Maven Central Repository [4],
> >>> * source code tag "v2.4.0-RC3" [5],
> >>> * website pull request listing the release and publishing the API
> >>> reference
> >>> manual [6].
> >>> * Java artifacts were built with Maven 3.2.5 and OpenJDK 1.8.0_112.
> >>> * Python artifacts are deployed along with the source release to
> the
> >>> dist.apache.org [2].
> >>>
> >>> The validation spreadsheet is available at
> >>>
> >>>
> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475
> >>>
> >>> The vote will be open for at least 72 hours. It is adopted by
> >>> majority
> >>> approval, with at least 3 PMC affirmative votes.
> >>>
> >>> Thanks,
> >>> - Robert
> >>>
> >>> [1]
> >>>
> >>>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342682&projectId=12319527
> >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
> >>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
> >>> [4]
> >>>
> https://repository.apache.org/content/repositories/orgapachebeam-1031/
> >>> [5] https://github.com/apache/beam/tree/v2.4.0-RC3
> >>> [6] https://github.com/apache/beam-site/pull/398
> >>
> >>
> >>
> >>
> >> --
> >> Gus Katsiapis | Software Engineer | katsia...@google.com |
> >> 650-918-7487
> >
> >
> 
> >>
> >
>


Re: [VOTE] Release 2.4.0, release candidate #3

2018-03-20 Thread Reuven Lax
+1 (binding)


On Tue, Mar 20, 2018 at 10:46 AM Robert Bradshaw 
wrote:

> On Tue, Mar 20, 2018 at 4:08 AM Ismaël Mejía  wrote:
>
>> +1 (binding)
>>
>> - Validated hashs
>> - mvn clean verify -Prelease OK
>> - Run nexmark on direct/flink/spark (it works save the regression
>> already tracked on RC2).
>>
>> Thanks Robert for being managing the release.
>>
>> ps. One 'future'  question after seeing the wheel artifacts (that seem
>> to be python version / OS specific) introduced in this RC I was
>> wondering how can we validate that those are correct for future
>> releases (probably something to add to the validation guide or to
>> automatize).
>>
>
> Good question. We should automate this. If their generation is automated
> (and runs tests) we can also have higher confidence in their correctness.
>
> On Tue, Mar 20, 2018 at 7:40 AM, Valentyn Tymofieiev
>>  wrote:
>> > I also tried to run python streaming mobile gaming examples
>> > (hourly_team_score, leader_board) on direct runner with little success:
>> > https://issues.apache.org/jira/browse/BEAM-3889. I think  they escaped
>> our
>> > attention on previous release validations. I just tried them with 2.3.0
>> and
>> > didn't have much luck either. Since I am not sure when these examples
>> last
>> > worked and warnings shows the examples are using deprecated constructs,
>> I
>> > suspect the issue is with examples, and we can address it separately
>> from
>> > 2.4.0 release.
>> >
>> > On Mon, Mar 19, 2018 at 7:21 PM, Valentyn Tymofieiev <
>> valen...@google.com>
>> > wrote:
>> >>
>> >> +1.
>> >> Ran a Python Streaming wordcount pipeline on Direct and Dataflow
>> runners
>> >> and Batch mobile gaming examples on Dataflow runner.
>> >>
>> >> On Mon, Mar 19, 2018 at 6:02 PM, Alan Myrvold 
>> wrote:
>> >>>
>> >>> +1 I ran the java quickstarts against 2.4.0 and they passed.
>> >>> ./gradlew :release:runQuickstartsJava
>> >>> -Prepourl=
>> https://repository.apache.org/content/repositories/orgapachebeam-1031
>> >>> -Pver=2.4.0
>> >>>
>> >>>
>> >>> On Mon, Mar 19, 2018 at 5:41 PM Ahmet Altay  wrote:
>> 
>>  I was able to run hourly_team_score. I was passing a wrong argument.
>> No
>>  need for an alarm. :)
>> 
>>  On Mon, Mar 19, 2018 at 5:33 PM, Ahmet Altay 
>> wrote:
>> >
>> > +1 Thank you Robert.
>> >
>> > Verified python mobile gaming examples using the wheel files on
>> direct
>> > runner. Got user_score working but hourly_team_score failed with
>> > (https://issues.apache.org/jira/browse/BEAM-3824). Since this is
>> an example,
>> > I think it is fine to continue with the release. I will work on
>> fixing the
>> > example post release.
>> >
>> > On Mon, Mar 19, 2018 at 2:46 PM, Konstantinos Katsiapis
>> >  wrote:
>> >>
>> >> +1, since Tf.Transform 0.6 depends on (and is blocked by) Beam 2.4
>> >>
>> >> On Sat, Mar 17, 2018 at 2:19 AM, Robert Bradshaw <
>> rober...@google.com>
>> >> wrote:
>> >>>
>> >>> Hi everyone,
>> >>>
>> >>> Please review and vote on the release candidate #3 for the version
>> >>> 2.4.0,
>> >>> as follows:
>> >>> [ ] +1, Approve the release
>> >>> [ ] -1, Do not approve the release (please provide specific
>> comments)
>> >>>
>> >>> The complete staging area is available for your review, which
>> >>> includes:
>> >>> * JIRA release notes [1],
>> >>> * the official Apache source release to be deployed to
>> >>> dist.apache.org [2],
>> >>> which is signed with the key with fingerprint BDC9 89B0 1BD2 A463
>> >>> 6010 A1CA
>> >>> 8F15 5E09 610D 69FB [3],
>> >>> * all artifacts to be deployed to the Maven Central Repository
>> [4],
>> >>> * source code tag "v2.4.0-RC3" [5],
>> >>> * website pull request listing the release and publishing the API
>> >>> reference
>> >>> manual [6].
>> >>> * Java artifacts were built with Maven 3.2.5 and OpenJDK
>> 1.8.0_112.
>> >>> * Python artifacts are deployed along with the source release to
>> the
>> >>> dist.apache.org [2].
>> >>>
>> >>> The validation spreadsheet is available at
>> >>>
>> >>>
>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit?ts=5a1c7310#gid=1663314475
>> >>>
>> >>> The vote will be open for at least 72 hours. It is adopted by
>> >>> majority
>> >>> approval, with at least 3 PMC affirmative votes.
>> >>>
>> >>> Thanks,
>> >>> - Robert
>> >>>
>> >>> [1]
>> >>>
>> >>>
>> https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12342682&projectId=12319527
>> >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.4.0/
>> >>> [3] https://dist.apache.org/repos/dist/dev/beam/KEYS
>> >>> [4]
>> >>>
>> https://repository.apache.org/content/repositories/orgapachebeam-1031/
>> >>> [5] https://github.com/apache/beam/tree/v2.4.0-RC3
>> >>> [6] https://github.com/apache/beam-site/pull/398
>

[RESULT] [VOTE] Release 2.4.0, release candidate #3

2018-03-20 Thread Robert Bradshaw
I'm happy to announce that we have unanimously approved this release.

There are 9 approving votes, 5 of which are binding:
* Lukasz Cwik
* Ahmet Altay
* Robert Bradshaw
* Jean-Baptiste Onofré
* Ismaël Mejía

The strong desire to get the teardown fixes in ASAP was also noted.

Thanks everyone!


Re: Pubsub API feedback

2018-03-20 Thread Ahmet Altay
Thank you Udi. Left some high level comments on the PR.


On Mon, Mar 19, 2018 at 5:13 PM, Udi Meiri  wrote:

> Hi,
> I wanted to get feedback about the upcoming Python Pubsub API. It is
> currently experimental and only supports reading and writing UTF-8 strings.
> My current proposal only concerns reading from Pubsub.
>
> Classes:
> - PubsubMessage: encapsulates Pubsub message payload and attributes.
>
> PTransforms:
> - ReadMessagesFromPubSub: Outputs elements of type ``PubsubMessage``.
>
> - ReadPayloadsFromPubSub: Outputs elements of type ``str``.
>
> - ReadStringsFromPubSub: Outputs elements of type ``unicode``, decoded
> from UTF-8.
>
> Description of common PTransform arguments:
>   topic: Cloud Pub/Sub topic in the form "projects//topics/<
> topic>".
> If provided, subscription must be None.
>   subscription: Existing Cloud Pub/Sub subscription to use in the
> form "projects//subscriptions/". If not
> specified,
> a temporary subscription will be created from the specified topic. If
> provided, topic must be None.
>   id_label: The attribute on incoming Pub/Sub messages to use as a unique
> record identifier. When specified, the value of this attribute (which
> can be any string that uniquely identifies the record) will be used for
> deduplication of messages. If not provided, we cannot guarantee
> that no duplicate data will be delivered on the Pub/Sub stream. In this
> case, deduplication of the stream will be strictly best effort.
>   timestamp_attribute: Message value to use as element timestamp. If None,
> uses message publishing time as the timestamp.
> Timestamp values should be in one of two formats:
> - A numerical value representing the number of milliseconds since the
> Unix
>   epoch.
> - A string in RFC 3339 format. For example,
>   {@code 2015-10-29T23:41:41.123Z}. The sub-second component of the
>   timestamp is optional, and digits beyond the first three (i.e., time
> units
>   smaller than milliseconds) will be ignored.
>
> Code: https://github.com/udim/beam/blob/b981dd618e9e1f667597eec2a91c72
> 65a389c405/sdks/python/apache_beam/io/gcp/pubsub.py
> PR: https://github.com/apache/beam/pull/4901
>
>


Re: Help with Dynamic writing

2018-03-20 Thread OrielResearch Eila Arich-Landkof
Hi Cham,

Please see inline. If possible, code / pseudo code will help a lot.
Thanks,
Eila

On Tue, Mar 20, 2018 at 1:15 PM, Chamikara Jayalath 
wrote:

> Hi Eila,
>
> Please find my comments inline.
>
> On Tue, Mar 20, 2018 at 8:02 AM OrielResearch Eila Arich-Landkof <
> e...@orielresearch.org> wrote:
>
>> Hello all,
>>
>> It was nice to meet you last week!!!
>>
>>
> It was nice to meet you as well :)
>
>
>> I am writing genomic pCollection that is created from bigQuery to a
>> folder. Following is the code with output so you can run it with any small
>> BQ table and let me know what your thoughts are:
>>
>> This init is only for debugging. In production I will use the pipeline
syntax

> rows = [{u'index': u'GSM2313641', u'SNRPCP14': 0},{u'index':
>> u'GSM231', u'SNRPCP14': 0},{u'index': u'GSM2312355', u'SNRPCP14':
>> 0},{u'index': u'GSM2312372', u'SNRPCP14': 0}]
>>
>> rows[1].keys()
>> # output:  [u'index', u'SNRPCP14']
>>
>> # you can change `archs4.results_20180308_ to any other table name with
>> index column
>> queries2 = rows | beam.Map(lambda x: (beam.io.Read(beam.io.
>> BigQuerySource(project='orielresearch-188115', use_standard_sql=False,
>> query=str('SELECT * FROM `archs4.results_20180308_*` where index=\'%s\'' %
>> (x["index"],
>>str('gs://archs4/output/'+x["
>> index"]+'/')))
>>
>
> I don't think above code will work (not portable across runners at least).
> BigQuerySource (along with Read transform) have to be applied to a Pipeline
> object. So probably change this to a for loop that creates a set of read
> transforms and use Flatten to create a single PCollection.
>
For debug, I am running on the local datalab runner. For the production, I
will be running only dataflow runner. I think that I was able to query the
tables that way, I will double check it. The indexes could go to millions -
my concern is that I will not be able to leverage on Beam distribution
capability when I use the the loop option. Any thoughts on that?

>
>
>>
>> queries2
>> # output: a list of pCollection and the path to write the pCollection
>> data to
>>
>> [(,
>>   'gs://archs4/output/GSM2313641/'),
>>  (,
>>   'gs://archs4/output/GSM231/'),
>>  (,
>>   'gs://archs4/output/GSM2312355/'),
>>  (,
>>   'gs://archs4/output/GSM2312372/')]
>>
>>
> What you got here is a PCollection of PTransform objects which is not
> useful.
>
>
>>
>> *# this is my challenge*
>> queries2 | 'write to relevant path' >> beam.io.WriteToText("SECOND
>> COLUMN")
>>
>>
> Once you update above code you will get a proper PCollection of elements
> read from BigQuery. You can transform and write this (to files, BQ, or any
> other sink) as needed.
>

it is a list of tupples with PCollection and the path to write to. the path
is not unique and I might have more than one PCollection written to the
same destination. How do I pass the path from the tupple list as a
parameter to the text file name? Could you please add the code that you
were thinking about?

> Please see programming guide on how to write to text files (section 5.3
> and click Python tab): https://beam.apache.org/documentation/programming-
> guide/
>
> Thanks,
> Cham
>
>
>> Do you have any idea how to sink the data to a text file? I have tried
>> few other options and was stuck at the write transform
>>
>> Any advice is very appreciated.
>>
>> Thanks,
>> Eila
>>
>>
>>
>> --
>> Eila
>> www.orielresearch.org
>> https://www.meetup.com/Deep-Learning-In-Production/
>>
>


-- 
Eila
www.orielresearch.org
https://www.meetup.com/Deep-Learning-In-Production/


Re: Help with Dynamic writing

2018-03-20 Thread Chamikara Jayalath
On Tue, Mar 20, 2018 at 12:54 PM OrielResearch Eila Arich-Landkof <
e...@orielresearch.org> wrote:

> Hi Cham,
>
> Please see inline. If possible, code / pseudo code will help a lot.
> Thanks,
> Eila
>
> On Tue, Mar 20, 2018 at 1:15 PM, Chamikara Jayalath 
> wrote:
>
>> Hi Eila,
>>
>> Please find my comments inline.
>>
>> On Tue, Mar 20, 2018 at 8:02 AM OrielResearch Eila Arich-Landkof <
>> e...@orielresearch.org> wrote:
>>
>>> Hello all,
>>>
>>> It was nice to meet you last week!!!
>>>
>>>
>> It was nice to meet you as well :)
>>
>>
>>> I am writing genomic pCollection that is created from bigQuery to a
>>> folder. Following is the code with output so you can run it with any small
>>> BQ table and let me know what your thoughts are:
>>>
>>> This init is only for debugging. In production I will use the pipeline
> syntax
>
>> rows = [{u'index': u'GSM2313641', u'SNRPCP14': 0},{u'index':
>>> u'GSM231', u'SNRPCP14': 0},{u'index': u'GSM2312355', u'SNRPCP14':
>>> 0},{u'index': u'GSM2312372', u'SNRPCP14': 0}]
>>>
>>> rows[1].keys()
>>> # output:  [u'index', u'SNRPCP14']
>>>
>>> # you can change `archs4.results_20180308_ to any other table name with
>>> index column
>>> queries2 = rows | beam.Map(lambda x: 
>>> (beam.io.Read(beam.io.BigQuerySource(project='orielresearch-188115',
>>> use_standard_sql=False, query=str('SELECT * FROM
>>> `archs4.results_20180308_*` where index=\'%s\'' % (x["index"],
>>>
>>>  str('gs://archs4/output/'+x["index"]+'/')))
>>>
>>
>> I don't think above code will work (not portable across runners at
>> least). BigQuerySource (along with Read transform) have to be applied to a
>> Pipeline object. So probably change this to a for loop that creates a set
>> of read transforms and use Flatten to create a single PCollection.
>>
> For debug, I am running on the local datalab runner. For the production, I
> will be running only dataflow runner. I think that I was able to query the
> tables that way, I will double check it. The indexes could go to millions -
> my concern is that I will not be able to leverage on Beam distribution
> capability when I use the the loop option. Any thoughts on that?
>

You mean you'll have millions of queries. That will not be scalable. My
suggestion was to loop on queries. Can you reduce to one or a small number
of queries and perform further processing in Beam ?


>
>>
>>>
>>> queries2
>>> # output: a list of pCollection and the path to write the pCollection
>>> data to
>>>
>>> [(,
>>>   'gs://archs4/output/GSM2313641/'),
>>>  (,
>>>   'gs://archs4/output/GSM231/'),
>>>  (,
>>>   'gs://archs4/output/GSM2312355/'),
>>>  (,
>>>   'gs://archs4/output/GSM2312372/')]
>>>
>>>
>> What you got here is a PCollection of PTransform objects which is not
>> useful.
>>
>>
>>>
>>> *# this is my challenge*
>>> queries2 | 'write to relevant path' >> beam.io.WriteToText("SECOND
>>> COLUMN")
>>>
>>>
>> Once you update above code you will get a proper PCollection of elements
>> read from BigQuery. You can transform and write this (to files, BQ, or any
>> other sink) as needed.
>>
>
> it is a list of tupples with PCollection and the path to write to. the
> path is not unique and I might have more than one PCollection written to
> the same destination. How do I pass the path from the tupple list as a
> parameter to the text file name? Could you please add the code that you
> were thinking about?
>

Python SDK does not support writing to different files based on the values
of data (dynamic writes). So you'll have to either partition data into
separate PCollections  or write all data into the same location.

Here's *pseudocode* (untested) for reading from few queries, partitioning
into several PCollections, and writing to different destinations.

*queries = ['select * from A', 'select * from B',]*

*p = Pipeline()*
*pcollections = []*
*for query in queries:*
*  pc = p | beam.io.Read(beam.io
.BigQuerySource(query=query))*
* pcollections.append(pc)*

*all_data = pcollections | beam.Flatten()*
*partitions = all_data | beam.Partition(my_partition_fn)*
*for i, partition in enumerate(partitions):*
*  partition | beam.io.WriteToText()*
Hope this helps.

Thanks,
Cham



> Please see programming guide on how to write to text files (section 5.3
>> and click Python tab):
>> https://beam.apache.org/documentation/programming-guide/
>>
>> Thanks,
>> Cham
>>
>>
>>> Do you have any idea how to sink the data to a text file? I have tried
>>> few other options and was stuck at the write transform
>>>
>>> Any advice is very appreciated.
>>>
>>> Thanks,
>>> Eila
>>>
>>>
>>>
>>> --
>>> Eila
>>> www.orielresearch.org
>>> https://www.meetup.com/Deep-Learning-In-Production/
>>>
>>
>
>
> --
> Eila
> www.orielresearch.org
> https://www.meetup.com/Deep-Learning-In-Production/
>


Common model for runners

2018-03-20 Thread Ron Gonzalez
Hi,  When I build a data flow using the Beam SDK, can someone point me to the 
code that represents the underlying representation of the beam model itself?  
Is there an API that lets me retrieve the underlying protobuf-based graph for 
the data flow? Perhaps some pointers to what code in the runner retrieves this 
model in order to execute it in the specific engine?
Thanks,Ron

Re: Common model for runners

2018-03-20 Thread Robert Bradshaw
The proto representation isn't (yet) part of the public API, and is still
under active development. However, if you're curious you can see it via
calling

pipeline.to_runner_api()

in Python or manually invoking classes under


https://github.com/apache/beam/tree/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction

for Java. There's probably some equivalent for Go.

(Eventually, the way to get at this would be to implement a Runner whose
input would be the proto description itself, but currently both Python and
Java represent Pipelines as native objects when invoking runners.)



On Tue, Mar 20, 2018 at 2:58 PM Ron Gonzalez  wrote:

> Hi,
>   When I build a data flow using the Beam SDK, can someone point me to the
> code that represents the underlying representation of the beam model itself?
>   Is there an API that lets me retrieve the underlying protobuf-based
> graph for the data flow? Perhaps some pointers to what code in the runner
> retrieves this model in order to execute it in the specific engine?
>
> Thanks,
> Ron
>


Re: Common model for runners

2018-03-20 Thread Henning Rohde
Go currently prints out the model pipeline (as well as the Dataflow
representation) if you use the Dataflow runner. Pass --dry_run=true to not
actually submit a job, but just print out the representations. The graphx
package can also be used to generate a model pipeline manually.


On Tue, Mar 20, 2018 at 3:19 PM Robert Bradshaw  wrote:

> The proto representation isn't (yet) part of the public API, and is still
> under active development. However, if you're curious you can see it via
> calling
>
> pipeline.to_runner_api()
>
> in Python or manually invoking classes under
>
>
> https://github.com/apache/beam/tree/master/runners/core-construction-java/src/main/java/org/apache/beam/runners/core/construction
>
> for Java. There's probably some equivalent for Go.
>
> (Eventually, the way to get at this would be to implement a Runner whose
> input would be the proto description itself, but currently both Python and
> Java represent Pipelines as native objects when invoking runners.)
>
>
>
> On Tue, Mar 20, 2018 at 2:58 PM Ron Gonzalez  wrote:
>
>> Hi,
>>   When I build a data flow using the Beam SDK, can someone point me to
>> the code that represents the underlying representation of the beam model
>> itself?
>>   Is there an API that lets me retrieve the underlying protobuf-based
>> graph for the data flow? Perhaps some pointers to what code in the runner
>> retrieves this model in order to execute it in the specific engine?
>>
>> Thanks,
>> Ron
>>
>


From Beam Summit - On SDKs and Contributor Experience

2018-03-20 Thread Pablo Estrada
Hello everyone,
at the Beam Summit in San Francisco, a number of folks had a breakout
session where we considered questions of the experience for new
contributors in Beam. First of all, I'd like to thank everyone for
participating with your insightful comments and discussion.
Along with Rafael, we tried to gather a few notes, and I'll start a couple
email threads on the list to discuss them.

The general theme was documentation and attracting new contributors, as
well as communication and expectations around SDK features. Some themes
within this were:
- Around JIRAs and assignation of work
- Around PR reviews and latency
- Around SDKs, features and communication
- Some other isolated ideas

I've gathered notes in this doc, so please feel free to take a look and add
your comments or points that you want considered:
https://docs.google.com/document/d/1WaK39qrrG_P50FOMHifJhrdHZYmjOOf8MgoObwCZI50/edit


In the next few days I'll start email threads to tackle each one
independently, and try to discuss some possible action items with the
community.
Best!
-P.

-- 
Got feedback? go/pabloem-feedback


Re: Implementing @OnWindowExpiration in StatefulParDo [BEAM-1589]

2018-03-20 Thread Huygaa Batsaikhan
As echauchot@ mentioned, it will make it easier and error-free.


On Mon, Mar 19, 2018 at 11:59 PM Romain Manni-Bucau 
wrote:

> Hi Huygaa,
>
> Cant it be predefined timers?
>
> Romain
>
> Le 20 mars 2018 00:52, "Huygaa Batsaikhan"  a écrit :
>
> Hi everyone, I am working on BEAM-1589
> . In short, currently,
> there is no default way of saving/flushing state before a window is garbage
> collected.
>
> My current plan is to provide a method annotation, @OnWindowExpiration,
> which allows user-provided callback function to be executed before garbage
> collection. This annotation behaves very similar to @OnTimer, therefore,
> implementation will mostly be a copy of OnTimer code. Let me know if you
> have any considerations and suggestions.
>
> Here is an example usage:
> ```
> @OnWindowExpiration
> public void myCleanupFunction(OnWindowExpirationContext c, State state) {
>   c.output(state.read());
> }
> ```
>
> Thanks, Huygaa
>
>
>