date:20190808

Seattle Beam Meetup - Sep 26

2019-08-08 Thread Aizhamal Nurmamat kyzy

Howdy,

For everyone who is around Seattle area, we will be hosting Beam meetup on
the 26th of September 6.00pm PST! Please join us for talks covering both
use cases and deep technical dives. It is also a great chance to get to
know local community members and meet contributors to Beam:) RSVP here [1].

Our first speaker - Brian Hulette will talk about portability/python
schemas & demo the SqlTransform in Python (cool stuff). Stay tuned for more
speaker announcements!

Also, if you wish to present your use case or talk at this event, please
send me an email.

Thanks,
Aizhamal

[1] https://www.meetup.com/Seattle-Apache-Beam-Meetup/events/263845364/

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

2019-08-08 Thread Ahmet Altay

Default plus a flag to override sounds reasonable. Although from Dataflow
experience I do not remember timeouts causing issues and each new added
flag adds complexity. What do others think?

On Thu, Aug 8, 2019 at 11:38 AM Kyle Weaver  wrote:

> If we do make a default, I still think it should be configurable via a
> flag. I can't think of why the prepare, stage artifact, job state, or job
> message requests might take more than 60 seconds, but you never know,
> particularly with artifact staging, which might be uploading artifacts to
> distributed storage.
>
> I assume the run request itself would not be subject to timeouts, as
> running the pipeline can be assumed to take significantly longer than the
> setup work.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Thu, Aug 8, 2019 at 11:20 AM Enrico Canzonieri  wrote:
>
>> Default timeout with no flag may work as well.
>> The main consideration here is whether some api calls may take longer
>> than 60 seconds because of the complexity of the users' Beam pipeline. E.g.
>> Could job_service.Prepare() take longer than 60 seconds if the given Beam
>> pipeline is extremely complex?
>>
>> Basically if there are cases when the user code may cause the call
>> duration to increase to the point the timeout prevents submitting the app
>> itself then we should consider having a flag.
>>
>> On 2019/08/07 20:13:12, Ahmet Altay wrote:
>> > Could we pick a default timeout value instead of introducing a flag? We
>> use>
>> > 60 seconds as the default timeout for http client [1], we can do the
>> same>
>> > here.>
>> >
>> > [1]>
>> >
>> https://github.com/apache/beam/blob/3a182d64c86ad038692800f5c343659ab0b935b0/sdks/python/apache_beam/internal/http_client.py#L32>
>>
>> >
>> > On Wed, Aug 7, 2019 at 11:53 AM enrico canzonieri >
>> > wrote:>
>> >
>> > > Hello,>
>> > >>
>> > > I noticed that the calls to the JobServer from the Python SDK do not
>> have>
>> > > timeouts. If I'm not mistaken that means that the call to
>> pipeline.run()>
>> > > could hang forever if the JobServer is not running (or failing to
>> start).>
>> > > E.g.>
>> > >
>> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307>
>>
>> > > the call to Prepare() doesn't provide any timeout value and the same>
>> > > applies to other JobServer requests.>
>> > > I was considering adding a --job-server-request-timeout to the>
>> > > PortableOptions>
>> > > >
>> > > class to be used in the JobServer interactions inside
>> probable_runner.py.>
>> > > Is there any specific reason why the timeout is not currently
>> supported?>
>> > > Does anybody have any objection adding the jobserver timeout? I
>> could>
>> > > volunteer to file a ticket and submit a pr for this.>
>> > >>
>> > > Cheers,>
>> > > Enrico Canzonieri>
>> > >>
>> >
>>
>>

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-08 Thread rahul patwari

I only ran in Direct runner. I will run in other runners and let you know
the results.
I am not setting "streaming" when executing.

On Fri 9 Aug, 2019, 2:56 AM Lukasz Cwik,  wrote:

> Have you tried running this on more than one runner (e.g. Dataflow, Flink,
> Direct)?
>
> Are you setting --streaming when executing?
>
> On Thu, Aug 8, 2019 at 10:23 AM rahul patwari 
> wrote:
>
>> Hi,
>>
>> I am getting inconsistent results when using GroupIntoBatches PTransform.
>> I am using Create.of() PTransform to create a PCollection from in-memory.
>> When a coder is given with Create.of() PTransform, I am facing the issue.
>> If the coder is not provided, the results are consistent and
>> correct(Maybe this is just a coincidence and the problem is at some other
>> place).
>> If Batch Size is 1, results are always consistent.
>>
>> Not sure if this is an issue with Serialization/Deserialization (or)
>> GroupIntoBatches (or) Create.of() PTransform.
>>
>> The Java code, expected correct results, and inconsistent results are
>> available at https://github.com/rahul8383/beam-examples
>>
>> Thanks,
>> Rahul
>>
>

Re: Jira email notifications

2019-08-08 Thread sridhar inuog

Thank you Udi Meiri and Pablo Estrada. It was a great opportunity to verify
Einstein's views on UI: Everything about UI should be made as simple as
possible but not simpler :)

There was a typo in my email and I couldn't see the email to correct it.
After it is corrected I am getting emails.

Cheers

On Thu, Aug 8, 2019 at 12:06 PM Udi Meiri  wrote:

> Is your email set correctly?
> You can see it if you hit the edit button for profile details.
> [image: UaSgEcLBGeM.png]
>
> On Wed, Aug 7, 2019 at 5:16 PM sridhar inuog 
> wrote:
>
>> Yes, I am already on the "Watchers" list
>>
>> On Wed, Aug 7, 2019 at 7:13 PM Pablo Estrada  wrote:
>>
>>> Have you tried "watching" the particular JIRA issue? There's a "Watch"
>>> thing on the right-hand side of an issue page.
>>>
>>> Happy to help more if that's not helpful : )
>>> Best
>>> -P.
>>>
>>> On Wed, Aug 7, 2019 at 5:09 PM sridhar inuog 
>>> wrote:
>>>
 Hi,
Is there a way to get notifications whenever a jira issue is
 updated?  The only place I can see this can be enabled is

  profile -> Preferences -> My Changes (Notify me)

 Even though the description seems a little bit misleading I don't see
 any other place to make any changes.

 ---
 Whether to email notifications of any changes you make.
 -

 Any other places I need to change to get notifications?

 Thanks,
 Sridhar

>>>

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Thomas Weise

We would also need to consider cross-language pipelines that (currently)
assume the interaction with an expansion service at construction time.

On Thu, Aug 8, 2019, 4:38 PM Kyle Weaver  wrote:

> > It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> Sure, that wouldn't be too big a change if we were to decide to go the SDK
> route.
>
> > For the Flink entry point we would need to allow for the job server to
> be used as a library.
>
> We don't need the whole job server, we only need to add a main method to
> FlinkPipelineRunner [1] as the entry point, which would basically just do
> the setup described in the doc then call FlinkPipelineRunner::run.
>
> [1]
> https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:
>
>> Hi Kyle,
>>
>> It might also be useful to have the option to just output the proto and
>> artifacts, as alternative to the jar file.
>>
>> For the Flink entry point we would need to allow for the job server to be
>> used as a library. It would probably not be too hard to have the Flink job
>> constructed via the context execution environment, which would require no
>> changes on the Flink side.
>>
>> Thanks,
>> Thomas
>>
>>
>> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver  wrote:
>>
>>> Re Javaless/serverless solution:
>>> I take it this would probably mean that we would construct the jar
>>> directly from the SDK. There are advantages to this: full separation of
>>> Python and Java environments, no need for a job server, and likely a
>>> simpler implementation, since we'd no longer have to work within the
>>> constraints of the existing job server infrastructure. The only downside I
>>> can think of is the additional cost of implementing/maintaining jar
>>> creation code in each SDK, but that cost may be acceptable if it's simple
>>> enough.
>>>
>>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>>
>>>
>>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise  wrote:
>>>


 On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
 wrote:

> > Before assembling the jar, the job server runs to create the
> ingredients. That requires the (matching) Java environment on the Python
> developers machine.
>
> We can run the job server and have it create the jar (and if we keep
> the job server running we can use it to interact with the running
> job). However, if the jar layout is simple enough, there's no need to
> even build it from Java.
>
> Taken to the extreme, this is a one-shot, jar-based JobService API. We
> choose a standard layout of where to put the pipeline description and
> artifacts, and can "augment" an existing jar (that has a
> runner-specific main class whose entry point knows how to read this
> data to kick off a pipeline as if it were a users driver code) into
> one that has a portable pipeline packaged into it for submission to a
> cluster.
>

 It would be nice if the Python developer doesn't have to run anything
 Java at all.

 As we just discussed offline, this could be accomplished by  including
 the proto that is produced by the SDK into the pre-existing jar.

 And if the jar has an entry point that creates the Flink job in the
 prescribed manner [1], it can be directly submitted to the Flink REST API.
 That would allow for Java free client.

 [1]
 https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Kyle Weaver

> It might also be useful to have the option to just output the proto and
artifacts, as alternative to the jar file.

Sure, that wouldn't be too big a change if we were to decide to go the SDK
route.

> For the Flink entry point we would need to allow for the job server to be
used as a library.

We don't need the whole job server, we only need to add a main method to
FlinkPipelineRunner [1] as the entry point, which would basically just do
the setup described in the doc then call FlinkPipelineRunner::run.

[1]
https://github.com/apache/beam/blob/master/runners/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineRunner.java#L53

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:

> Hi Kyle,
>
> It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> For the Flink entry point we would need to allow for the job server to be
> used as a library. It would probably not be too hard to have the Flink job
> constructed via the context execution environment, which would require no
> changes on the Flink side.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver  wrote:
>
>> Re Javaless/serverless solution:
>> I take it this would probably mean that we would construct the jar
>> directly from the SDK. There are advantages to this: full separation of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside I
>> can think of is the additional cost of implementing/maintaining jar
>> creation code in each SDK, but that cost may be acceptable if it's simple
>> enough.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
>>> wrote:
>>>
 > Before assembling the jar, the job server runs to create the
 ingredients. That requires the (matching) Java environment on the Python
 developers machine.

 We can run the job server and have it create the jar (and if we keep
 the job server running we can use it to interact with the running
 job). However, if the jar layout is simple enough, there's no need to
 even build it from Java.

 Taken to the extreme, this is a one-shot, jar-based JobService API. We
 choose a standard layout of where to put the pipeline description and
 artifacts, and can "augment" an existing jar (that has a
 runner-specific main class whose entry point knows how to read this
 data to kick off a pipeline as if it were a users driver code) into
 one that has a portable pipeline packaged into it for submission to a
 cluster.

>>>
>>> It would be nice if the Python developer doesn't have to run anything
>>> Java at all.
>>>
>>> As we just discussed offline, this could be accomplished by  including
>>> the proto that is produced by the SDK into the pre-existing jar.
>>>
>>> And if the jar has an entry point that creates the Flink job in the
>>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>>> That would allow for Java free client.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>
>>>

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Thomas Weise

I also added this as option for pipeline submission to the k8s discussion:

https://docs.google.com/document/d/1z3LNrRtr8kkiFHonZ5JJM_L4NWNBBNcqRc_yAf6G0VI/edit#heading=h.iov21d695qx5


On Thu, Aug 8, 2019 at 4:21 PM Thomas Weise  wrote:

> Hi Kyle,
>
> It might also be useful to have the option to just output the proto and
> artifacts, as alternative to the jar file.
>
> For the Flink entry point we would need to allow for the job server to be
> used as a library. It would probably not be too hard to have the Flink job
> constructed via the context execution environment, which would require no
> changes on the Flink side.
>
> Thanks,
> Thomas
>
>
> On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver  wrote:
>
>> Re Javaless/serverless solution:
>> I take it this would probably mean that we would construct the jar
>> directly from the SDK. There are advantages to this: full separation of
>> Python and Java environments, no need for a job server, and likely a
>> simpler implementation, since we'd no longer have to work within the
>> constraints of the existing job server infrastructure. The only downside I
>> can think of is the additional cost of implementing/maintaining jar
>> creation code in each SDK, but that cost may be acceptable if it's simple
>> enough.
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>>
>>
>> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise  wrote:
>>
>>>
>>>
>>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
>>> wrote:
>>>
 > Before assembling the jar, the job server runs to create the
 ingredients. That requires the (matching) Java environment on the Python
 developers machine.

 We can run the job server and have it create the jar (and if we keep
 the job server running we can use it to interact with the running
 job). However, if the jar layout is simple enough, there's no need to
 even build it from Java.

 Taken to the extreme, this is a one-shot, jar-based JobService API. We
 choose a standard layout of where to put the pipeline description and
 artifacts, and can "augment" an existing jar (that has a
 runner-specific main class whose entry point knows how to read this
 data to kick off a pipeline as if it were a users driver code) into
 one that has a portable pipeline packaged into it for submission to a
 cluster.

>>>
>>> It would be nice if the Python developer doesn't have to run anything
>>> Java at all.
>>>
>>> As we just discussed offline, this could be accomplished by  including
>>> the proto that is produced by the SDK into the pre-existing jar.
>>>
>>> And if the jar has an entry point that creates the Flink job in the
>>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>>> That would allow for Java free client.
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>>
>>>

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Thomas Weise

Hi Kyle,

It might also be useful to have the option to just output the proto and
artifacts, as alternative to the jar file.

For the Flink entry point we would need to allow for the job server to be
used as a library. It would probably not be too hard to have the Flink job
constructed via the context execution environment, which would require no
changes on the Flink side.

Thanks,
Thomas


On Thu, Aug 8, 2019 at 9:52 AM Kyle Weaver  wrote:

> Re Javaless/serverless solution:
> I take it this would probably mean that we would construct the jar
> directly from the SDK. There are advantages to this: full separation of
> Python and Java environments, no need for a job server, and likely a
> simpler implementation, since we'd no longer have to work within the
> constraints of the existing job server infrastructure. The only downside I
> can think of is the additional cost of implementing/maintaining jar
> creation code in each SDK, but that cost may be acceptable if it's simple
> enough.
>
> Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com
>
>
> On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise  wrote:
>
>>
>>
>> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
>> wrote:
>>
>>> > Before assembling the jar, the job server runs to create the
>>> ingredients. That requires the (matching) Java environment on the Python
>>> developers machine.
>>>
>>> We can run the job server and have it create the jar (and if we keep
>>> the job server running we can use it to interact with the running
>>> job). However, if the jar layout is simple enough, there's no need to
>>> even build it from Java.
>>>
>>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>>> choose a standard layout of where to put the pipeline description and
>>> artifacts, and can "augment" an existing jar (that has a
>>> runner-specific main class whose entry point knows how to read this
>>> data to kick off a pipeline as if it were a users driver code) into
>>> one that has a portable pipeline packaged into it for submission to a
>>> cluster.
>>>
>>
>> It would be nice if the Python developer doesn't have to run anything
>> Java at all.
>>
>> As we just discussed offline, this could be accomplished by  including
>> the proto that is produced by the SDK into the pre-existing jar.
>>
>> And if the jar has an entry point that creates the Flink job in the
>> prescribed manner [1], it can be directly submitted to the Flink REST API.
>> That would allow for Java free client.
>>
>> [1]
>> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>>
>>

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

2019-08-08 Thread Mark Liu

Adding my comments below.

On Mon, Jul 15, 2019 at 2:52 PM Kenneth Knowles  wrote:

> Gradle comments inline
>
> On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode 
> wrote:
>
>> Hi Mark & others,
>>
>> +1 on using this structure. I don't see any other alternative to gradle
>> as some of the Python tasks have Java tasks as
>> a dependency. You can't debug that using just `python nosetests... or
>> tox`.  Parallelizing such tasks requires different
>> projects (and I don't think gradle supports multiple projects per
>> directory), so for each python
>> version we need a different folder. Having seperate build.gradle files
>> for each python version would also enable the
>> different versions to diverge on which tests they execute (e.g. not
>> running some tests for python 3.5 and 3.6 to
>> reduce Jenkins footprint in a PreCommit).
>>
>> Valentyn, the code duplication problem can be addressed using a Gradle
>> script plugin , which
>> is a build.script (maybe another name is better?)  in
>> /sdks/python/test-suites/[runner], which you then import in
>> /sdks/python/test-suites/[runner]/pyXX/build.gradle with the
>> correct pythonVersion set, and then using `apply from`. See [1] for an
>> example. The location of
>> the python2 test-suite can be remedied by moving it to your
>> suggestion. +1 on that as well.
>>
>
> The `apply from` syntax is another way of authoring a gradle plugin, and a
> bit more limited. BeamModulePlugin used to be a script `build_rules.groovy`
> applied that way, and we moved it to the magical buildSrc/ directory so it
> could have its own clear dependencies (its the buildSrc/build.gradle) and
> be refactored into pieces over time. It is just as easy to make a plugin
> there and I would recommend it. This is what we did with the vendored
> artifacts.
>

+1 using Gradle plugin. New build.gradle also creates an unnecessary Gradle
project which adds overheads in execution. Since we already have some
plugin examples in buildSrc/, adding extra one for a particular purpose (or
using existing plugins) would be preferred.

>
> On the coupling of projects or the structure not being natural, I think
>> that we can look at
>> this differently. Right now, in the Python SDK, all common code for tests
>> that needs to be parallelized is
>> in placed in BeamModulePlugin, which in turn couples all projects that
>> use it. It's one centralized
>> item that couples many different projects. It has code from Java, Python
>> and Go, and
>> currently has almost 2000 lines of code, which is IMHO not the way to go.
>>
>
> Exactly. Each of BeamModulePlugin.applyXYZNature(...) are probably good to
> separate as other plugins, for example you could start BeamPythonPlugin.
> The reason these are separate methods instead of separate plugins is
> because we started off in one big `build_rules.groovy` file; it is
> historical and totally OK to fix up.
>

One extra bonus for separating BeamModulePlugin per language is that we can
avoid irrelevant precommit tests being triggered when making changes in
BeamModulePlugin. This came to me recently when I made Python specific
changes. Some Java and Go precommit tests triggered because I touched
BeamModulePlugin. Sometimes the test is flaky so I have to
investigate/verify the failure is not related to my changes.

>
>
Moving the python code
>> from this binary plugin to a script plugin file defined in the the parent
>> directory of children
>> projects that use that code (as described in the paragraph above) moves
>> the coupling of a lot of projects through
>> BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local
>> coupling).
>>
>
> Relative filesystem paths are also a weakness that caused pain plenty of
> times. I highly recommend building a plugin with an id rather than `apply
> from` relative paths such as parent directories. Or if there is a way to
> `apply from` without a relative path, that is probably even more confusing.
> buildSrc is the way to go. It is also much faster since it gains
> incremental compilation of the plugin.
>
> Kenn
>
>
>> In short, yes it might be unnatural, but it's still better than before.
>>
>> [1] https://github.com/apache/beam/pull/8877
>> 
>>
>> Thanks,
>> Frederik Bode
>>
>> On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Hey Mark & others,
>>>
>>> We've been following the structure proposed in this thread to extend
>>> test coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see
>>> [1].
>>>
>>> This structure allowed us to add 3.x suites without slowing down the
>>> pre/postcommit execution time. We can actually see a drop in precommit
>>> latency [2] around March 23 we first made some Python 3.x suites run in
>>> parallel, and we have added more suites since then without slowing down
>>> pre/postcommits. Therefore I am in favor of this proposal, especially since
>>> AFAIK we don't have better one. Thanks a lot!
>>>
>>> I do have some fe

Re: Inconsistent Results with GroupIntoBatches PTransform

2019-08-08 Thread Lukasz Cwik

Have you tried running this on more than one runner (e.g. Dataflow, Flink,
Direct)?

Are you setting --streaming when executing?

On Thu, Aug 8, 2019 at 10:23 AM rahul patwari 
wrote:

> Hi,
>
> I am getting inconsistent results when using GroupIntoBatches PTransform.
> I am using Create.of() PTransform to create a PCollection from in-memory.
> When a coder is given with Create.of() PTransform, I am facing the issue.
> If the coder is not provided, the results are consistent and correct(Maybe
> this is just a coincidence and the problem is at some other place).
> If Batch Size is 1, results are always consistent.
>
> Not sure if this is an issue with Serialization/Deserialization (or)
> GroupIntoBatches (or) Create.of() PTransform.
>
> The Java code, expected correct results, and inconsistent results are
> available at https://github.com/rahul8383/beam-examples
>
> Thanks,
> Rahul
>

Re: Dataflow worker overview graphs

2019-08-08 Thread Mikhail Gryzykhin

Unfortunately no, I don't have those for streaming explicitly.

However most of code is shared between streaming and batch with main
difference in initialization. Same goes for boilerplate parts of legacy vs
FnApi.

If you happen to create anything similar for streaming, please update page
and let me know. Also I'll update this page with relevant changes once I
get back to worker.

--Mikhail

On Thu, Aug 8, 2019 at 2:13 PM Ankur Goenka  wrote:

> Thanks Mikhail. This is really useful.
> Do you also have something similar for Streaming use case. More
> specifically for Portable (fn_api) based streaming pipelines.
>
>
> On Thu, Aug 8, 2019 at 2:08 PM Mikhail Gryzykhin 
> wrote:
>
>> Hello everybody,
>>
>> Just wanted to share that I have found some graphs for dataflow worker I
>> created while starting working on it. They cover specific scenarios, but
>> may be useful for newcomers, so I put them into this wiki page
>> 
>> .
>>
>> If you feel they belong to some other location, please let me know.
>>
>> Regards,
>> Mikhail.
>>
>

Re: Dataflow worker overview graphs

2019-08-08 Thread Ankur Goenka

Thanks Mikhail. This is really useful.
Do you also have something similar for Streaming use case. More
specifically for Portable (fn_api) based streaming pipelines.

On Thu, Aug 8, 2019 at 2:08 PM Mikhail Gryzykhin  wrote:

> Hello everybody,
>
> Just wanted to share that I have found some graphs for dataflow worker I
> created while starting working on it. They cover specific scenarios, but
> may be useful for newcomers, so I put them into this wiki page
> 
> .
>
> If you feel they belong to some other location, please let me know.
>
> Regards,
> Mikhail.
>

Dataflow worker overview graphs

2019-08-08 Thread Mikhail Gryzykhin

Hello everybody,

Just wanted to share that I have found some graphs for dataflow worker I
created while starting working on it. They cover specific scenarios, but
may be useful for newcomers, so I put them into this wiki page

.

If you feel they belong to some other location, please let me know.

Regards,
Mikhail.

Re: Proposal for SDFs in the Go SDK

2019-08-08 Thread Lukasz Cwik

Thanks for the informative doc. Added a bunch of questions/feedback.

On Thu, Aug 8, 2019 at 9:15 AM Robert Burke  wrote:

> Thanks for the spending the time writing this up! I'm looking forward to
> seeing how the prototype implementation plays out. In particular with the
> extensive section on how users will actually use the presented API to get
> their DoFns to scale.
>
>  (Disclosure: I helped pre-review the document, which is why I don't have
> any further commentary at this time.)
>
> On Wed, Aug 7, 2019, 11:57 AM Daniel Oliveira 
> wrote:
>
>> Hello Beam devs,
>>
>> I've been working on a proposal for implementing SDFs in the Go SDK. For
>> those who were unaware, the Go SDK hasn't supported SDFs in any capacity
>> yet, so my proposal covers the user-facing API and a basic look into how it
>> will work under the hood.
>>
>> I'd appreciate it if anyone interested in the Go SDK or anyone who's been
>> working with portable SDFs could give it a look and provide some feedback.
>> There's still a few open questions mentioned in the doc that I'd like to
>> get feedback on before deciding on anything.
>>
>>
>> https://docs.google.com/document/d/14IwJYEUpar5FmiPNBFvERADiShZjsrsMpgtlntPVCX0/edit?usp=sharing
>>
>> Thanks,
>> Daniel Oliveira
>>
>

Re: Java 11 compatibility question

2019-08-08 Thread Valentyn Tymofieiev

>From Python 3 migration standpoint, some high level pillars that increase
our confidence are:
- Test coverage: (PreCommit, PostCommit), creating a system to make it easy
for add test coverage in new language for new functionality.
- Support of new language version by core runners + ValidatesRunner test
coverage.
- Test of time: offer new functionality in a few releases, monitor &
address user feedback.

Dependency audit and critical feature support in new language, as mentioned
by others, are important  points. If you are curious about detailed AIs
that went into Python 3 support, feel free to look into BEAM-1251 or Py3
Kanban Board (
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=245&view=detail
).

Thanks,
Valentyn


On Thu, Aug 8, 2019 at 7:24 PM Mark Liu  wrote:

> Some actions we did for py2 to py3 works:
> - Check and resolve incompatible dependencies.
> - Enable py3 lint.
> - Fill feature gaps between py2 and py3 (e.g. new py3 container, new
> solution for type hint)
> - Add unit tests, integration tests and other tests on py3 for coverage.
> - Release (p3) and deprecation (p2) plan.
>
> Hope this helps on Java upgrade.
>
> Mark
>
> On Wed, Aug 7, 2019 at 3:19 PM Ahmet Altay  wrote:
>
>>
>>
>> On Wed, Aug 7, 2019 at 12:21 PM Elliotte Rusty Harold 
>> wrote:
>>
>>> gRPC bug here: https://github.com/grpc/grpc-java/issues/3522
>>>
>>> google-cloud-java bug:
>>> https://github.com/googleapis/google-cloud-java/issues/5760
>>>
>>> Neither has a cheap or easy fix, I'm afraid. Commenting on these
>>> issues might help us prove that there's a demand to priorotize these
>>> compared to other work. If anyone has a support contract and could
>>> file a ticket asking for a fix, that would help even more.
>>>
>>> Those are the two I know about. There might be others elsewhere in the
>>> dependency tree.
>>>
>>>
>>> On Wed, Aug 7, 2019 at 2:25 PM Lukasz Cwik  wrote:
>>> >
>>> > Since java8 -> java11 is similar to python2 -> python3 migration, what
>>> was the acceptance criteria there?
>>>
>>
>> I do not remember formally discussing this. The bar used was, all
>> existing tests will pass for python2 and python3. New tests will be added
>> for python3 specific features. (To avoid any confusion this bar has not
>> been cleared yet.)
>>
>> cc: +Valentyn Tymofieiev  could add more details.
>>
>>
>>> >
>>> > On Wed, Aug 7, 2019 at 1:54 PM Elliotte Rusty Harold <
>>> elh...@ibiblio.org> wrote:
>>> >>
>>> >>
>>> >>
>>> >> On Wed, Aug 7, 2019 at 9:41 AM Michał Walenia <
>>> michal.wale...@polidea.com> wrote:
>>> >>>
>>> >>>
>>> >>> Are these tests sufficient to say that we’re java 11 compatible?
>>> What other aspects do we need to test to be able to say that?
>>> >>>
>>> >>>
>>> >>
>>> >> Are any packages split across multiple jar files, including packages
>>> beam dependns on? That's the one that's bitten some other projects,
>>> including google-cloud-java and gRPC. If so, beam is not going to work with
>>> the module system.
>>> >>
>>> >> Work is ongoing to fix splitn packages in both gRPC and
>>> google-cloud-java, but we're not very far down that path and I think it's
>>> going to be an API breaking change.
>>> >>
>>> > Romain pointed this out earlier and I fixed the last case of packages
>>> being split across multiple jars within Apache Beam but as you point out
>>> our transitive dependencies are not ready.
>>> >>
>>> >>
>>> >> --
>>> >> Elliotte Rusty Harold
>>> >> elh...@ibiblio.org
>>>
>>>
>>>
>>> --
>>> Elliotte Rusty Harold
>>> elh...@ibiblio.org
>>>
>>

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

2019-08-08 Thread Kyle Weaver

If we do make a default, I still think it should be configurable via a
flag. I can't think of why the prepare, stage artifact, job state, or job
message requests might take more than 60 seconds, but you never know,
particularly with artifact staging, which might be uploading artifacts to
distributed storage.

I assume the run request itself would not be subject to timeouts, as
running the pipeline can be assumed to take significantly longer than the
setup work.

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Thu, Aug 8, 2019 at 11:20 AM Enrico Canzonieri  wrote:

> Default timeout with no flag may work as well.
> The main consideration here is whether some api calls may take longer than
> 60 seconds because of the complexity of the users' Beam pipeline. E.g.
> Could job_service.Prepare() take longer than 60 seconds if the given Beam
> pipeline is extremely complex?
>
> Basically if there are cases when the user code may cause the call
> duration to increase to the point the timeout prevents submitting the app
> itself then we should consider having a flag.
>
> On 2019/08/07 20:13:12, Ahmet Altay wrote:
> > Could we pick a default timeout value instead of introducing a flag? We
> use>
> > 60 seconds as the default timeout for http client [1], we can do the
> same>
> > here.>
> >
> > [1]>
> >
> https://github.com/apache/beam/blob/3a182d64c86ad038692800f5c343659ab0b935b0/sdks/python/apache_beam/internal/http_client.py#L32>
>
> >
> > On Wed, Aug 7, 2019 at 11:53 AM enrico canzonieri >
> > wrote:>
> >
> > > Hello,>
> > >>
> > > I noticed that the calls to the JobServer from the Python SDK do not
> have>
> > > timeouts. If I'm not mistaken that means that the call to pipeline.run()>
>
> > > could hang forever if the JobServer is not running (or failing to
> start).>
> > > E.g.>
> > >
> https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307>
>
> > > the call to Prepare() doesn't provide any timeout value and the same>
> > > applies to other JobServer requests.>
> > > I was considering adding a --job-server-request-timeout to the>
> > > PortableOptions>
> > > >
> > > class to be used in the JobServer interactions inside
> probable_runner.py.>
> > > Is there any specific reason why the timeout is not currently
> supported?>
> > > Does anybody have any objection adding the jobserver timeout? I could>
> > > volunteer to file a ticket and submit a pr for this.>
> > >>
> > > Cheers,>
> > > Enrico Canzonieri>
> > >>
> >
>
>

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

2019-08-08 Thread Enrico Canzonieri

Default timeout with no flag may work as well.

The main consideration here is whether some api calls may take longer than 60 
seconds because of the complexity of the users' Beam pipeline. E.g. Could 
job_service.Prepare() take longer than 60 seconds if the given Beam pipeline is 
extremely complex?

Basically if there are cases when the user code may cause the call duration to 
increase to the point the timeout prevents submitting the app itself then we 
should consider having a flag.

On 2019/08/07 20:13:12, Ahmet Altay wrote:

> Could we pick a default timeout value instead of introducing a flag? We use>

> 60 seconds as the default timeout for http client [1], we can do the same>

> here.>

>

> [1]>

> https://github.com/apache/beam/blob/3a182d64c86ad038692800f5c343659ab0b935b0/sdks/python/apache_beam/internal/http_client.py#L32
>  >

>

> On Wed, Aug 7, 2019 at 11:53 AM enrico canzonieri >

> wrote:>

>

> > Hello,>

> >>

> > I noticed that the calls to the JobServer from the Python SDK do not have>

> > timeouts. If I'm not mistaken that means that the call to pipeline.run ( 
> > http://pipeline.run/ ) ()>

> > could hang forever if the JobServer is not running (or failing to start).>

> > E.g.>

> > https://github.com/apache/beam/blob/master/sdks/python/apache_beam/runners/portability/portable_runner.py#L307
> >  >

> > the call to Prepare() doesn't provide any timeout value and the same>

> > applies to other JobServer requests.>

> > I was considering adding a --job-server-request-timeout to the>

> > PortableOptions>

> > >

> > class to be used in the JobServer interactions inside probable_runner.py ( 
> > http://probable_runner.py/ ).>

> > Is there any specific reason why the timeout is not currently supported?>

> > Does anybody have any objection adding the jobserver timeout? I could>

> > volunteer to file a ticket and submit a pr for this.>

> >>

> > Cheers,>

> > Enrico Canzonieri>

> >>

>

Re: Allowing firewalled/offline builds of Beam

2019-08-08 Thread Lukasz Cwik

Udi beat me by a couple of mins.

We build a good portion of the Beam Java codebase internally within Google
by bypassing the gradle wrapper (gradlew) and executing the gradle command
from a full gradle installation at the root of a copy of the Beam codebase.

It does require your internal build system to use a version of gradle that
is compatible with the version[1] that gradlew uses and you could create a
wrapper that figures out which version of gradle to use and select the
appropriate one from many local gradle installations. This should allow you
to bypass the gradlew script entirely and any downloading it does.

Note that gradle does support a --offline flag which we also use to ensure
that it doesn't pull stuff from the internet. Not sure if all the plugins
honor it but it works well enough for us to build most of the Beam Java
codebase with it.

1:
https://github.com/apache/beam/blob/497bc77c0d53098887156a014a659184097ef021/gradle/wrapper/gradle-wrapper.properties#L20

On Thu, Aug 8, 2019 at 11:15 AM Udi Meiri  wrote:

> You can download it here: https://gradle.org/releases/
> and run it instead of using the wrapper.
>
> Example:
> $ cd
> $ unzip Downloads/gradle-5.5.1-bin.zip
> $ cd ~/src/beam
> $ ~/gradle-5.5.1/bin/gradle lint
>
>
> On Thu, Aug 8, 2019 at 10:52 AM Chad Dombrova  wrote:
>
>> This topic came up in another thread, so I wanted to highlight a few
>> things that we've discovered in our endeavors to build Beam behind a
>> firewall.
>>
>> Conceptually, in order to allow this, a user needs to provide alternate
>> mirrors for each "artifact" service required during build, and luckily I
>> think most of the toolchains used by Beam support this. For example, the
>> default PyPI mirror used by pip can be overridden via env var to an
>> internal mirror, and likewise for docker and its registry service.  I'm
>> currently looking into gogradle to see if we can provide an alternate
>> vendor directory as a shared resource behind our firewall. (I have a bigger
>> question here, which is why was it necessary to add a third language into
>> the python Beam ecosystem, just for the bootstrap process?  Couldn't the
>> boot code use python, or java?)
>>
>> But I'm getting ahead of myself.  We're actually stuck at the very
>> beginning, with gradlew.  The gradlew wrapper seems to unconditionally
>> download gradle, so you can't get past the first few hundred lines of code
>> in the build process without requiring internet access.  I made a ticket
>> here: https://issues.apache.org/jira/browse/BEAM-7931.  I'd love some
>> pointers on how to fix this, because the offending code lives inside
>> gradle-wrapper.jar, so I can't change it without access to the source.
>>
>> thanks,
>> -chad
>>
>>

Re: Allowing firewalled/offline builds of Beam

2019-08-08 Thread Udi Meiri

You can download it here: https://gradle.org/releases/
and run it instead of using the wrapper.

Example:
$ cd
$ unzip Downloads/gradle-5.5.1-bin.zip
$ cd ~/src/beam
$ ~/gradle-5.5.1/bin/gradle lint


On Thu, Aug 8, 2019 at 10:52 AM Chad Dombrova  wrote:

> This topic came up in another thread, so I wanted to highlight a few
> things that we've discovered in our endeavors to build Beam behind a
> firewall.
>
> Conceptually, in order to allow this, a user needs to provide alternate
> mirrors for each "artifact" service required during build, and luckily I
> think most of the toolchains used by Beam support this. For example, the
> default PyPI mirror used by pip can be overridden via env var to an
> internal mirror, and likewise for docker and its registry service.  I'm
> currently looking into gogradle to see if we can provide an alternate
> vendor directory as a shared resource behind our firewall. (I have a bigger
> question here, which is why was it necessary to add a third language into
> the python Beam ecosystem, just for the bootstrap process?  Couldn't the
> boot code use python, or java?)
>
> But I'm getting ahead of myself.  We're actually stuck at the very
> beginning, with gradlew.  The gradlew wrapper seems to unconditionally
> download gradle, so you can't get past the first few hundred lines of code
> in the build process without requiring internet access.  I made a ticket
> here: https://issues.apache.org/jira/browse/BEAM-7931.  I'd love some
> pointers on how to fix this, because the offending code lives inside
> gradle-wrapper.jar, so I can't change it without access to the source.
>
> thanks,
> -chad
>
>


smime.p7s
Description: S/MIME Cryptographic Signature

Allowing firewalled/offline builds of Beam

2019-08-08 Thread Chad Dombrova

This topic came up in another thread, so I wanted to highlight a few things
that we've discovered in our endeavors to build Beam behind a firewall.

Conceptually, in order to allow this, a user needs to provide alternate
mirrors for each "artifact" service required during build, and luckily I
think most of the toolchains used by Beam support this. For example, the
default PyPI mirror used by pip can be overridden via env var to an
internal mirror, and likewise for docker and its registry service.  I'm
currently looking into gogradle to see if we can provide an alternate
vendor directory as a shared resource behind our firewall. (I have a bigger
question here, which is why was it necessary to add a third language into
the python Beam ecosystem, just for the bootstrap process?  Couldn't the
boot code use python, or java?)

But I'm getting ahead of myself.  We're actually stuck at the very
beginning, with gradlew.  The gradlew wrapper seems to unconditionally
download gradle, so you can't get past the first few hundred lines of code
in the build process without requiring internet access.  I made a ticket
here: https://issues.apache.org/jira/browse/BEAM-7931.  I'd love some
pointers on how to fix this, because the offending code lives inside
gradle-wrapper.jar, so I can't change it without access to the source.

thanks,
-chad

Inconsistent Results with GroupIntoBatches PTransform

2019-08-08 Thread rahul patwari

Hi,

I am getting inconsistent results when using GroupIntoBatches PTransform.
I am using Create.of() PTransform to create a PCollection from in-memory.
When a coder is given with Create.of() PTransform, I am facing the issue.
If the coder is not provided, the results are consistent and correct(Maybe
this is just a coincidence and the problem is at some other place).
If Batch Size is 1, results are always consistent.

Not sure if this is an issue with Serialization/Deserialization (or)
GroupIntoBatches (or) Create.of() PTransform.

The Java code, expected correct results, and inconsistent results are
available at https://github.com/rahul8383/beam-examples

Thanks,
Rahul

Re: Jira email notifications

2019-08-08 Thread Udi Meiri

Is your email set correctly?
You can see it if you hit the edit button for profile details.
[image: UaSgEcLBGeM.png]

On Wed, Aug 7, 2019 at 5:16 PM sridhar inuog  wrote:

> Yes, I am already on the "Watchers" list
>
> On Wed, Aug 7, 2019 at 7:13 PM Pablo Estrada  wrote:
>
>> Have you tried "watching" the particular JIRA issue? There's a "Watch"
>> thing on the right-hand side of an issue page.
>>
>> Happy to help more if that's not helpful : )
>> Best
>> -P.
>>
>> On Wed, Aug 7, 2019 at 5:09 PM sridhar inuog 
>> wrote:
>>
>>> Hi,
>>>Is there a way to get notifications whenever a jira issue is
>>> updated?  The only place I can see this can be enabled is
>>>
>>>  profile -> Preferences -> My Changes (Notify me)
>>>
>>> Even though the description seems a little bit misleading I don't see
>>> any other place to make any changes.
>>>
>>> ---
>>> Whether to email notifications of any changes you make.
>>> -
>>>
>>> Any other places I need to change to get notifications?
>>>
>>> Thanks,
>>> Sridhar
>>>
>>


smime.p7s
Description: S/MIME Cryptographic Signature

Re: Write-through-cache in State logic

2019-08-08 Thread Lukasz Cwik

The purpose of the new state API call in BEAM-7000 is to tell the runner
that the SDK is now blocked waiting for the result of a specific state
request and it should be used for fetches (not updates) and is there to
allow for SDKs to differentiate readLater (I will need this data at some
point in time in the future) from read (I need this data now). This comes
up commonly where the user prefetches multiple state cells and then looks
at their content allowing the runner to batch up those calls on its end.

The way it can be used for clear+append is that the runner can store
requests in memory up until some time/memory limit or until it gets its
first "blocked" call and then issue all the requests together.


On Thu, Aug 8, 2019 at 9:42 AM Robert Bradshaw  wrote:

> On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise  wrote:
> >
> > That would add a synchronization point that forces extra latency
> especially in streaming mode.
> >
> > Wouldn't it be possible for the runner to assign the token when starting
> the bundle and for the SDK to pass it along the state requests? That way,
> there would be no need to batch and wait for a flush.
>
> I think it makes sense to let the runner pre-assign these state update
> tokens rather than forcing a synchronization point.
>
> Here's some pointers for the Python implementation:
>
> Currently, when a DoFn needs UserState, a StateContext object is used
> that converts from a StateSpec to the actual value. When running
> portably, this is FnApiUserStateContext [1]. The state handles
> themselves are cached at [2] but this context only lives for the
> lifetime of a single bundle. Logic could be added here to use the
> token to share these across bundles.
>
> Each of these handles in turn invokes state_handler.get* methods when
> its read is called. (Here state_handler is a thin wrapper around the
> service itself) and constructs the appropriate result from the
> StateResponse. We would need to implement caching at this level as
> well, including the deserialization. This will probably require some
> restructoring of how _StateBackedIterable is implemented (or,
> possibly, making that class itself cache aware). Hopefully that's
> enough to get started.
>
> [1]
> https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L402
> [2]
> https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L436
> .
>
> > On Mon, Aug 5, 2019 at 2:49 PM Lukasz Cwik  wrote:
> >>
> >> I believe the intent is to add a new state API call telling the runner
> that it is blocked waiting for a response (BEAM-7000).
> >>
> >> This should allow the runner to wait till it sees one of these I'm
> blocked requests and then merge + batch any state calls it may have at that
> point in time allowing it to convert clear + appends into set calls and do
> any other optimizations as well. By default, the runner would have a time
> and space based limit on how many outstanding state calls there are before
> choosing to resolve them.
> >>
> >> On Mon, Aug 5, 2019 at 5:43 PM Lukasz Cwik  wrote:
> >>>
> >>> Now I see what you mean.
> >>>
> >>> On Mon, Aug 5, 2019 at 5:42 PM Thomas Weise  wrote:
> 
>  Hi Luke,
> 
>  I guess the answer is that it depends on the state backend. If a set
> operation in the state backend is available that is more efficient than
> clear+append, then it would be beneficial to have a dedicated fn api
> operation to allow for such optimization. That's something that needs to be
> determined with a profiler :)
> 
>  But the low hanging fruit is cross-bundle caching.
> 
>  Thomas
> 
>  On Mon, Aug 5, 2019 at 2:06 PM Lukasz Cwik  wrote:
> >
> > Thomas, why do you think a single round trip is needed?
> >
> > clear + append can be done blindly from the SDK side and it has
> total knowledge of the state at that point in time till the end of the
> bundle at which point you want to wait to get the cache token back from the
> runner for the append call so that for the next bundle you can reuse the
> state if the key wasn't processed elsewhere.
> >
> > Also, all state calls are "streamed" over gRPC so you don't need to
> wait for clear to complete before being able to send append.
> >
> > On Tue, Jul 30, 2019 at 12:58 AM jincheng sun <
> sunjincheng...@gmail.com> wrote:
> >>
> >> Hi Rakesh,
> >>
> >> Glad to see you pointer this problem out!
> >> +1 for add this implementation. Manage State by write-through-cache
> is pretty important for Streaming job!
> >>
> >> Best, Jincheng
> >>
> >> Thomas Weise  于2019年7月29日周一 下午8:54写道：
> >>>
> >>> FYI a basic test appears to confirm the importance of the
> cross-bundle caching: I found that the throughput can be increased by
> playing with the bundle size in the Flink runner. Default caps at 1000
> elements (or 1 second). So on a high throughput stream

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Kyle Weaver

Re Javaless/serverless solution:
I take it this would probably mean that we would construct the jar directly
from the SDK. There are advantages to this: full separation of Python and
Java environments, no need for a job server, and likely a simpler
implementation, since we'd no longer have to work within the constraints of
the existing job server infrastructure. The only downside I can think of is
the additional cost of implementing/maintaining jar creation code in each
SDK, but that cost may be acceptable if it's simple enough.

Kyle Weaver | Software Engineer | github.com/ibzib | kcwea...@google.com


On Thu, Aug 8, 2019 at 9:31 AM Thomas Weise  wrote:

>
>
> On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw 
> wrote:
>
>> > Before assembling the jar, the job server runs to create the
>> ingredients. That requires the (matching) Java environment on the Python
>> developers machine.
>>
>> We can run the job server and have it create the jar (and if we keep
>> the job server running we can use it to interact with the running
>> job). However, if the jar layout is simple enough, there's no need to
>> even build it from Java.
>>
>> Taken to the extreme, this is a one-shot, jar-based JobService API. We
>> choose a standard layout of where to put the pipeline description and
>> artifacts, and can "augment" an existing jar (that has a
>> runner-specific main class whose entry point knows how to read this
>> data to kick off a pipeline as if it were a users driver code) into
>> one that has a portable pipeline packaged into it for submission to a
>> cluster.
>>
>
> It would be nice if the Python developer doesn't have to run anything Java
> at all.
>
> As we just discussed offline, this could be accomplished by  including the
> proto that is produced by the SDK into the pre-existing jar.
>
> And if the jar has an entry point that creates the Flink job in the
> prescribed manner [1], it can be directly submitted to the Flink REST API.
> That would allow for Java free client.
>
> [1]
> https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E
>
>

Re: Caused by: java.lang.IllegalArgumentException: URI is not hierarchical

2019-08-08 Thread Lukasz Cwik

+user 

Can you supply the full stacktrace for the exception?

On Tue, Aug 6, 2019 at 3:22 PM Jayanth Kolli 
wrote:

> Getting following error while trying to run DataFloeRunner example in Self
> executing JAR mode.
> I have example word count spring boot application running fine with
> spring-boot:run maven, but fails with self executing
>

Re: Write-through-cache in State logic

2019-08-08 Thread Robert Bradshaw

On Tue, Aug 6, 2019 at 12:07 AM Thomas Weise  wrote:
>
> That would add a synchronization point that forces extra latency especially 
> in streaming mode.
>
> Wouldn't it be possible for the runner to assign the token when starting the 
> bundle and for the SDK to pass it along the state requests? That way, there 
> would be no need to batch and wait for a flush.

I think it makes sense to let the runner pre-assign these state update
tokens rather than forcing a synchronization point.

Here's some pointers for the Python implementation:

Currently, when a DoFn needs UserState, a StateContext object is used
that converts from a StateSpec to the actual value. When running
portably, this is FnApiUserStateContext [1]. The state handles
themselves are cached at [2] but this context only lives for the
lifetime of a single bundle. Logic could be added here to use the
token to share these across bundles.

Each of these handles in turn invokes state_handler.get* methods when
its read is called. (Here state_handler is a thin wrapper around the
service itself) and constructs the appropriate result from the
StateResponse. We would need to implement caching at this level as
well, including the deserialization. This will probably require some
restructoring of how _StateBackedIterable is implemented (or,
possibly, making that class itself cache aware). Hopefully that's
enough to get started.

[1] 
https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L402
[2] 
https://github.com/apache/beam/blob/release-2.14.0/sdks/python/apache_beam/runners/worker/bundle_processor.py#L436
.

> On Mon, Aug 5, 2019 at 2:49 PM Lukasz Cwik  wrote:
>>
>> I believe the intent is to add a new state API call telling the runner that 
>> it is blocked waiting for a response (BEAM-7000).
>>
>> This should allow the runner to wait till it sees one of these I'm blocked 
>> requests and then merge + batch any state calls it may have at that point in 
>> time allowing it to convert clear + appends into set calls and do any other 
>> optimizations as well. By default, the runner would have a time and space 
>> based limit on how many outstanding state calls there are before choosing to 
>> resolve them.
>>
>> On Mon, Aug 5, 2019 at 5:43 PM Lukasz Cwik  wrote:
>>>
>>> Now I see what you mean.
>>>
>>> On Mon, Aug 5, 2019 at 5:42 PM Thomas Weise  wrote:

 Hi Luke,

 I guess the answer is that it depends on the state backend. If a set 
 operation in the state backend is available that is more efficient than 
 clear+append, then it would be beneficial to have a dedicated fn api 
 operation to allow for such optimization. That's something that needs to 
 be determined with a profiler :)

 But the low hanging fruit is cross-bundle caching.

 Thomas

 On Mon, Aug 5, 2019 at 2:06 PM Lukasz Cwik  wrote:
>
> Thomas, why do you think a single round trip is needed?
>
> clear + append can be done blindly from the SDK side and it has total 
> knowledge of the state at that point in time till the end of the bundle 
> at which point you want to wait to get the cache token back from the 
> runner for the append call so that for the next bundle you can reuse the 
> state if the key wasn't processed elsewhere.
>
> Also, all state calls are "streamed" over gRPC so you don't need to wait 
> for clear to complete before being able to send append.
>
> On Tue, Jul 30, 2019 at 12:58 AM jincheng sun  
> wrote:
>>
>> Hi Rakesh,
>>
>> Glad to see you pointer this problem out!
>> +1 for add this implementation. Manage State by write-through-cache is 
>> pretty important for Streaming job!
>>
>> Best, Jincheng
>>
>> Thomas Weise  于2019年7月29日周一 下午8:54写道：
>>>
>>> FYI a basic test appears to confirm the importance of the cross-bundle 
>>> caching: I found that the throughput can be increased by playing with 
>>> the bundle size in the Flink runner. Default caps at 1000 elements (or 
>>> 1 second). So on a high throughput stream the bundles would be capped 
>>> by the count limit. Bumping the count limit increases the throughput by 
>>> reducing the chatter over the state plane (more cache hits due to 
>>> larger bundle).
>>>
>>> The next level of investigation would involve profiling. But just by 
>>> looking at metrics, the CPU utilization on the Python worker side 
>>> dropped significantly while on the Flink side it remains nearly same. 
>>> There are no metrics for state operations on either side, I think it 
>>> would be very helpful to get these in place also.
>>>
>>> Below the stateful processing code for reference.
>>>
>>> Thomas
>>>
>>>
>>> class StatefulFn(beam.DoFn):
>>> count_state_spec = userstate.CombiningValueStateSpec(
>>> 'count', beam.coders.IterableCoder(beam.c

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Thomas Weise

On Thu, Aug 8, 2019 at 8:29 AM Robert Bradshaw  wrote:

> > Before assembling the jar, the job server runs to create the
> ingredients. That requires the (matching) Java environment on the Python
> developers machine.
>
> We can run the job server and have it create the jar (and if we keep
> the job server running we can use it to interact with the running
> job). However, if the jar layout is simple enough, there's no need to
> even build it from Java.
>
> Taken to the extreme, this is a one-shot, jar-based JobService API. We
> choose a standard layout of where to put the pipeline description and
> artifacts, and can "augment" an existing jar (that has a
> runner-specific main class whose entry point knows how to read this
> data to kick off a pipeline as if it were a users driver code) into
> one that has a portable pipeline packaged into it for submission to a
> cluster.
>

It would be nice if the Python developer doesn't have to run anything Java
at all.

As we just discussed offline, this could be accomplished by  including the
proto that is produced by the SDK into the pre-existing jar.

And if the jar has an entry point that creates the Flink job in the
prescribed manner [1], it can be directly submitted to the Flink REST API.
That would allow for Java free client.

[1]
https://lists.apache.org/thread.html/6db869c53816f4e2917949a7c6992c2b90856d7d639d7f2e1cd13768@%3Cdev.flink.apache.org%3E

Re: Java 11 compatibility question

2019-08-08 Thread Mark Liu

Some actions we did for py2 to py3 works:
- Check and resolve incompatible dependencies.
- Enable py3 lint.
- Fill feature gaps between py2 and py3 (e.g. new py3 container, new
solution for type hint)
- Add unit tests, integration tests and other tests on py3 for coverage.
- Release (p3) and deprecation (p2) plan.

Hope this helps on Java upgrade.

Mark

On Wed, Aug 7, 2019 at 3:19 PM Ahmet Altay  wrote:

>
>
> On Wed, Aug 7, 2019 at 12:21 PM Elliotte Rusty Harold 
> wrote:
>
>> gRPC bug here: https://github.com/grpc/grpc-java/issues/3522
>>
>> google-cloud-java bug:
>> https://github.com/googleapis/google-cloud-java/issues/5760
>>
>> Neither has a cheap or easy fix, I'm afraid. Commenting on these
>> issues might help us prove that there's a demand to priorotize these
>> compared to other work. If anyone has a support contract and could
>> file a ticket asking for a fix, that would help even more.
>>
>> Those are the two I know about. There might be others elsewhere in the
>> dependency tree.
>>
>>
>> On Wed, Aug 7, 2019 at 2:25 PM Lukasz Cwik  wrote:
>> >
>> > Since java8 -> java11 is similar to python2 -> python3 migration, what
>> was the acceptance criteria there?
>>
>
> I do not remember formally discussing this. The bar used was, all existing
> tests will pass for python2 and python3. New tests will be added for
> python3 specific features. (To avoid any confusion this bar has not been
> cleared yet.)
>
> cc: +Valentyn Tymofieiev  could add more details.
>
>
>> >
>> > On Wed, Aug 7, 2019 at 1:54 PM Elliotte Rusty Harold <
>> elh...@ibiblio.org> wrote:
>> >>
>> >>
>> >>
>> >> On Wed, Aug 7, 2019 at 9:41 AM Michał Walenia <
>> michal.wale...@polidea.com> wrote:
>> >>>
>> >>>
>> >>> Are these tests sufficient to say that we’re java 11 compatible? What
>> other aspects do we need to test to be able to say that?
>> >>>
>> >>>
>> >>
>> >> Are any packages split across multiple jar files, including packages
>> beam dependns on? That's the one that's bitten some other projects,
>> including google-cloud-java and gRPC. If so, beam is not going to work with
>> the module system.
>> >>
>> >> Work is ongoing to fix splitn packages in both gRPC and
>> google-cloud-java, but we're not very far down that path and I think it's
>> going to be an API breaking change.
>> >>
>> > Romain pointed this out earlier and I fixed the last case of packages
>> being split across multiple jars within Apache Beam but as you point out
>> our transitive dependencies are not ready.
>> >>
>> >>
>> >> --
>> >> Elliotte Rusty Harold
>> >> elh...@ibiblio.org
>>
>>
>>
>> --
>> Elliotte Rusty Harold
>> elh...@ibiblio.org
>>
>

Re: Proposal for SDFs in the Go SDK

2019-08-08 Thread Robert Burke

Thanks for the spending the time writing this up! I'm looking forward to
seeing how the prototype implementation plays out. In particular with the
extensive section on how users will actually use the presented API to get
their DoFns to scale.

 (Disclosure: I helped pre-review the document, which is why I don't have
any further commentary at this time.)

On Wed, Aug 7, 2019, 11:57 AM Daniel Oliveira 
wrote:

> Hello Beam devs,
>
> I've been working on a proposal for implementing SDFs in the Go SDK. For
> those who were unaware, the Go SDK hasn't supported SDFs in any capacity
> yet, so my proposal covers the user-facing API and a basic look into how it
> will work under the hood.
>
> I'd appreciate it if anyone interested in the Go SDK or anyone who's been
> working with portable SDFs could give it a look and provide some feedback.
> There's still a few open questions mentioned in the doc that I'd like to
> get feedback on before deciding on anything.
>
>
> https://docs.google.com/document/d/14IwJYEUpar5FmiPNBFvERADiShZjsrsMpgtlntPVCX0/edit?usp=sharing
>
> Thanks,
> Daniel Oliveira
>

Re: [ANNOUNCE] New committer: Kyle Weaver

2019-08-08 Thread Robert Burke

Congrats! Also, thanks for getting the Go SDK integration tests running
against Flink and Spark as well :D

On Wed, Aug 7, 2019, 11:21 PM Rakesh Kumar  wrote:

> Congrats Kyle!!
>
> On Wed, Aug 7, 2019 at 11:30 AM Heejong Lee  wrote:
>
>> Congratulations!
>>
>> On Wed, Aug 7, 2019 at 11:05 AM Tanay Tummalapalli 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Wed, Aug 7, 2019 at 11:27 PM Robin Qiu  wrote:
>>>
 Congratulations, Kyle!

 On Wed, Aug 7, 2019 at 5:04 AM Valentyn Tymofieiev 
 wrote:

> Congrats, Kyle!
>
> On Wed, Aug 7, 2019 at 1:01 PM Ismaël Mejía  wrote:
>
>> Congrats Kyle, well deserved :clap: !
>>
>> On Wed, Aug 7, 2019, 11:22 AM Gleb Kanterov  wrote:
>>
>>> Congratulations!
>>>
>>> On Wed, Aug 7, 2019 at 7:01 AM Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Well done congratulations Kyle!!!

 On Tue, Aug 6, 2019 at 21:58 Thomas Weise  wrote:

> Congrats!
>
> On Tue, Aug 6, 2019, 7:24 PM Reza Rokni  wrote:
>
>> Congratz!
>>
>> On Wed, 7 Aug 2019 at 06:40, Chamikara Jayalath <
>> chamik...@google.com> wrote:
>>
>>> Congrats!!
>>>
>>> On Tue, Aug 6, 2019 at 3:33 PM Udi Meiri 
>>> wrote:
>>>
 Congrats Kyle!

 On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak <
 meliss...@google.com> wrote:

> Congratulations Kyle!
>
> On Tue, Aug 6, 2019 at 1:36 PM Yichi Zhang 
> wrote:
>
>> Congrats Kyle!
>>
>> On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
>> aizha...@google.com> wrote:
>>
>>> Thank you, Kyle! And congratulations :)
>>>
>>> On Tue, Aug 6, 2019 at 10:09 AM Hannah Jiang <
>>> hannahji...@google.com> wrote:
>>>
 Congrats Kyle!

 On Tue, Aug 6, 2019 at 9:52 AM David Morávek <
 david.mora...@gmail.com> wrote:

> Congratulations Kyle!!
>
> Sent from my iPhone
>
> On 6 Aug 2019, at 18:47, Anton Kedin 
> wrote:
>
> Congrats!
>
> On Tue, Aug 6, 2019, 9:37 AM Ankur Goenka <
> goe...@google.com> wrote:
>
>> Congratulations Kyle!
>>
>> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay <
>> al...@google.com> wrote:
>>
>>> Hi,
>>>
>>> Please join me and the rest of the Beam PMC in welcoming
>>> a new committer: Kyle Weaver.
>>>
>>> Kyle has been contributing to Beam for a while now. And
>>> in that time period Kyle got the portable spark runner 
>>> feature complete for
>>> batch processing. [1]
>>>
>>> In consideration of Kyle's contributions, the Beam PMC
>>> trusts him with the responsibilities of a Beam committer
>>>  [2].
>>>
>>> Thank you, Kyle, for your contributions and looking
>>> forward to many more!
>>>
>>> Ahmet, on behalf of the Apache Beam PMC
>>>
>>> [1]
>>> https://lists.apache.org/thread.html/c43678fc24c9a1dc9f48c51c51950aedcb9bc0fd3b633df16c3d595a@%3Cuser.beam.apache.org%3E
>>> [2] https://beam.apache.org/contribute/become-a-
>>> committer/#an-apache-beam-committer
>>>
>>
>>
>> --
>>
>> This email may be confidential and privileged. If you received
>> this communication by mistake, please don't forward it to anyone 
>> else,
>> please erase all copies and attachments, and please let me know that 
>> it has
>> gone to the wrong person.
>>
>> The above terms reflect a potential business arrangement, are
>> provided solely as a basis for further discussion, and are not 
>> intended to
>> be and do not constitute a legally binding obligation. No legally 
>> binding
>> obligations will be created, implied, or inferred until an agreement 
>> in
>> final form is executed in writing by all parties involved.
>>
>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>

Re: (mini-doc) Beam (Flink) portable job templates

2019-08-08 Thread Robert Bradshaw

On Wed, Aug 7, 2019 at 5:59 PM Thomas Weise  wrote:
>
>> > * The pipeline construction code itself may need access to cluster 
>> > resources. In such cases the jar file cannot be created offline.
>>
>> Could you elaborate?
>
>
> The entry point is arbitrary code written by the user, not limited to Beam 
> pipeline construction alone. For example, there could be access to a file 
> system or other service to fetch metadata that is required to build the 
> pipeline. Such services can be accessed when the code runs within the 
> infrastructure, but typically not in a development environment.

Yes, this may be limited to the case that the pipeline construction
can be done on the users machine before submission (remotely staging
the executing the Python (or Go, or ...) code within the
infrastructure to build the pipeline and then running the job server
there is a bit more complicated). We control the entry point from then
on.

>> > * For k8s deployment, a container image with the SDK and application code 
>> > is required for the worker. The jar file (which is really a derived 
>> > artifact) would need to be built in addition to the container image.
>>
>> Yes. For standard use, a vanilla released Beam published SDK container
>> + staged artifacts should be sufficient.
>>
>> > * To build such jar file, the user would need a build environment with job 
>> > server and application code. Do we want to make that assumption?
>>
>> Actually, it's probably much easier than that. A jar file is just a
>> zip file with a standard structure, to which one can easily add (data)
>> files without having a full build environment. The (pre-compiled) main
>> class would know how to read this data to construct the pipeline and
>> kick off the job just like any other Flink job.
>
> Before assembling the jar, the job server runs to create the ingredients. 
> That requires the (matching) Java environment on the Python developers 
> machine.

We can run the job server and have it create the jar (and if we keep
the job server running we can use it to interact with the running
job). However, if the jar layout is simple enough, there's no need to
even build it from Java.

Taken to the extreme, this is a one-shot, jar-based JobService API. We
choose a standard layout of where to put the pipeline description and
artifacts, and can "augment" an existing jar (that has a
runner-specific main class whose entry point knows how to read this
data to kick off a pipeline as if it were a users driver code) into
one that has a portable pipeline packaged into it for submission to a
cluster.

Re: [ANNOUNCE] New committer: Rui Wang

2019-08-08 Thread Thomas Weise

Congrats!

On Wed, Aug 7, 2019, 11:22 PM Rakesh Kumar  wrote:

> Congrats Rui!!
>
> On Wed, Aug 7, 2019 at 12:54 PM Rui Wang  wrote:
>
>> Thank you guys! Looking forward to contributing more to Beam community!
>>
>>
>> -Rui
>>
>> On Wed, Aug 7, 2019 at 11:05 AM Tanay Tummalapalli 
>> wrote:
>>
>>> Congratulations!
>>>
>>> On Wed, Aug 7, 2019 at 11:28 PM Robin Qiu  wrote:
>>>
 Congratulations, Rui!

 On Wed, Aug 7, 2019 at 5:03 AM Valentyn Tymofieiev 
 wrote:

> Congrats, Rui!
>
> On Wed, Aug 7, 2019 at 1:00 PM Ismaël Mejía  wrote:
>
>> Congrats Rui!
>>
>> On Wed, Aug 7, 2019, 11:37 AM Gleb Kanterov  wrote:
>>
>>> Congratulations Rui! Well done!
>>>
>>> On Wed, Aug 7, 2019 at 7:01 AM Connell O'Callaghan <
>>> conne...@google.com> wrote:
>>>
 Well done Rui!!!

 On Tue, Aug 6, 2019 at 15:41 Chamikara Jayalath <
 chamik...@google.com> wrote:

> Congrats Rui.
>
> On Tue, Aug 6, 2019 at 2:00 PM Melissa Pashniak <
> meliss...@google.com> wrote:
>
>> Congrats Rui!
>>
>> On Tue, Aug 6, 2019 at 1:37 PM Yichi Zhang 
>> wrote:
>>
>>> Congrats Rui!
>>>
>>> On Tue, Aug 6, 2019 at 1:29 PM Aizhamal Nurmamat kyzy <
>>> aizha...@google.com> wrote:
>>>
 Congratulations, Rui! Thank you for your contributions to Beam!

 On Tue, Aug 6, 2019 at 10:35 AM sridhar inuog <
 sridharin...@gmail.com> wrote:

> Congratulations Rui!
>
> On Tue, Aug 6, 2019 at 12:09 PM Hannah Jiang <
> hannahji...@google.com> wrote:
>
>> Congrats Rui!
>>
>> On Tue, Aug 6, 2019 at 9:50 AM Yifan Zou 
>> wrote:
>>
>>> Congratulations Rui!
>>>
>>> On Tue, Aug 6, 2019 at 9:47 AM Anton Kedin 
>>> wrote:
>>>
 Congrats!

 On Tue, Aug 6, 2019, 9:36 AM Ankur Goenka <
 goe...@google.com> wrote:

> Congratulations Rui!
> Well deserved 👏
>
> On Tue, Aug 6, 2019 at 9:35 AM Ahmet Altay <
> al...@google.com> wrote:
>
>> Hi,
>>
>> Please join me and the rest of the Beam PMC in welcoming
>> a new committer: Rui Wang.
>>
>> Rui has been an active contributor since May 2018. Rui
>> has been very active in Beam SQL [1] and continues to help 
>> out on user@
>> and StackOverflow. Rui is one of the top answerers for 
>> apache-beam tag [2].
>>
>> In consideration of Rui's contributions, the Beam PMC
>> trusts him with the responsibilities of a Beam committer
>>  [3].
>>
>> Thank you, Rui, for your contributions and looking
>> forward to many more!
>>
>> Ahmet, on behalf of the Apache Beam PMC
>>
>> [1]
>> https://github.com/apache/beam/pulls?q=is%3Apr+author%3Aamaliujia
>> [2] https://stackoverflow.com/tags/apache-beam/topusers
>> [3] https://beam.apache.org/contribute/become-a-committer
>> /#an-apache-beam-committer
>>
>
>>>
>>> --
>>> Cheers,
>>> Gleb
>>>
>>

Re: [PROPOSAL] An initial Schema API in Python

2019-08-08 Thread Robert Bradshaw

On Wed, Aug 7, 2019 at 11:12 PM Brian Hulette  wrote:
>
> Thanks for all the suggestions, I've added responses inline.
>
> On Wed, Aug 7, 2019 at 12:52 PM Chad Dombrova  wrote:
>>
>> There’s a lot of ground to cover here, so I’m going to pull from a few 
>> different responses.
>>
>> 
>>
>> numpy ints
>>
>> A properly written library should accept any type implementing the int (or 
>> index) methods in place of an int, rather than doing explicit type checks
>>
>> Yes, but the reality is that very very few actually do this, including Beam 
>> itself (check the code for Timestamp and Duration, to name a few).
>>
>> Which brings me to my next topic:
>>
>> I tested this out with mypy and it would not be compatible:
>>
>> def square(x: int):
>> return x*x
>>
>> square(np.int16(32)) # mypy error
>>
>> The proper way to check this scenario is using typing.SupportsInt. Note that 
>> this only guarantees that __int__ exists, so you still need to cast to int 
>> if you want to do anything with the object:
>>
>> def square(x: typing.SupportsInt) -> int:
>> if not isinstance(x, int):
>> x = int(x)
>> return x*x
>>
>> square('foo')  # error!
>> square(1.2)  # ok
>
>  Yep I came across this while writing my last reply. I agree though it seems 
> unlikely that many libraries actually do this.
>
>> 
>>
>> Native python ints
>>
>> Agreed on float since it seems to trivially map to a double, but I’m torn on 
>> int still. While I do want int type hints to work, it doesn’t seem 
>> appropriate to map it to AtomicType.INT64, since it has a completely 
>> different range of values.
>>
>> Let’s say we used native int for the runtime field type, not just as a 
>> schema declaration for numpy.int64. What is the real world fallout from 
>> this? Would there be data loss?
>
> I'm not sure I follow the question exactly, what is the interplay between int 
> and numpy.int64 in this scenario? Are you saying that np.int64 is used in the 
> schema declaration, but we just use native int at runtime, and check the bit 
> width when encoding?
>
> In any case, I don't think the real world fallout of using int is nearly that 
> dire. I suppose data loss is possible if a poorly designed pipeline overflows 
> an int64 and crashes,

The primary risk is that it *won't* crash when overflowing an int64,
it'll just silently give the wrong answer. That's much less safe than
using a native int and then actually crashing in the case it's too
large at the point one tries to encode it.

> but that's possible whether we use int or np.int64 at runtime. I'm just 
> saying that a user could be forgiven for thinking that they're safe from 
> overflows if they declare a schema as NamedTuple('Foo', 
> [('to_infinity_and_beyond', int)]), but they shouldn't make the same mistake 
> when they explicitly call it an int64.

Yes. But for schemas to be maximally useful, we'll want to be able to
infer them from all sorts of things that aren't written with Beam in
mind (e.g. external data classes, function annotations) and rejecting
the builtin int type will be a poor user experience here.

>> 
>>
>> Python3-only
>>
>> No need to worry about 2/3 compatibility for strings, we could just use str
>>
>> This is already widely handled throughout the Beam python SDK using the 
>> future/past library, so it seems silly to give up on this solution for 
>> schemas.
>>
>> On this topic, I added some comments to the PR about using 
>> past.builtins.unicode instead of numpy.unicode. They’re the same type, but 
>> there’s no reason to get this via numpy, when everywhere else in the code 
>> gets it from past.
>>
>> We could just use bytes for byte arrays (as a shorthand for 
>> typing.ByteString [1])
>>
>> Neat, but in my obviously very biased opinion it is not worth cutting off 
>> python2 users over this.
>
> Ok I won't do this :) I wasn't aware of typing.Sequence, that does seem like 
> a good fit. The other two items are just nice-to-haves, I'm happy to work 
> around those and use Sequence for arrays instead.

I would imagine that we could accept bytes or typing.ByteString for
BYTES, with only Python 2 users having to do the latter. (In both
Python 2 and Python 3 one would use str for STRING, it would decode to
past.builtins.unicode. This seems to capture the intent better than
mapping str to BYTES in Python 2 only.)

Re: Waht would be the best place for performance tests documentation?

2019-08-08 Thread Łukasz Gajowy

Thanks for opinions. I'll publish it on Confluence in the above-mentioned
"testing guide". I also like this idea the best.

Thanks,
Łukasz

śr., 7 sie 2019 o 19:55 Lukasz Cwik  napisał(a):

> I also think confluence makes the most sense.
>
> On Wed, Aug 7, 2019 at 11:57 AM Alexey Romanenko 
> wrote:
>
>> I agree with Cyrus that Confluence page should a good place for that
>> since, seems, it will be very dev oriented documentation.
>>
>>
>> On 7 Aug 2019, at 16:31, Cyrus Maden  wrote:
>>
>> Hi Łukasz,
>>
>> This sounds perfect for the confluence, since we already have the testing
>> guide
>> 
>> in there. There's already a link to the testing guide in the
>> beam.apache.org/contribute section, which might help direct folks to the
>> new doc as well.
>>
>> Best,
>> Cyrus
>>
>> On Wed, Aug 7, 2019 at 7:54 AM Łukasz Gajowy  wrote:
>>
>>> Hi all,
>>>
>>> I'm currently working on documenting the load tests of Core Apache Beam
>>> operations
>>> 
>>> as we have some Jenkins jobs running and several dashboards for that.
>>>
>>> This is what I've got so far (work in progress but comments are
>>> welcome): LINK
>>> 
>>>
>>> I've got the following doubts on where I should put it:
>>>
>>>- the tests are rather dev-facing. Should the documentation for them
>>>be on the beam.apache.org page (where the user-facing documentation
>>>is located) or maybe only on Beam's confluence?
>>>- if that belongs to confluence only, should I simply place it under
>>>"technical/design docs" section?
>>>- if this goes on the website, what is the best place to put it?
>>>
>>> What are your thoughts?
>>>
>>> Thanks,
>>> Łukasz
>>>
>>
>>

Re: Brief of interactive Beam

2019-08-08 Thread Robert Bradshaw

Thanks for the note. Are there any associated documents worth sharing as
well? More below.

On Wed, Aug 7, 2019 at 9:39 PM Ning Kang  wrote:

> To whom may concern,
>
> This is Ning from Google. We are currently making efforts to leverage an
> interactive runner under python beam sdk.
>
> There is already an interactive Beam (iBeam for short) runner with jupyter
> notebook in the repo
> 
> .
> Following the instructions on that page, one can set up an interactive
> environment to develop and execute Beam pipeline interactively.
>
> However, there are many issues with existing iBeam. One issue is that it
> uses a concept of leaf PCollection to cache and materialize intermediate
> PCollection. If the user wants to reuse/introspect a non-leaf PCollection,
> the interactive runner will run into errors.
>
> Our initial effort will be fixing the existing issues. And we also want to
> make iBeam easy to use. Since iBeam uses the same model Beam uses, there
> isn't really any difference for users between creating a pipeline with
> interactive runner and other runners.
> So we want to minimize the interfaces a user needs to learn while giving
> the user some capability to interact with the interactive environment.
>
> See this initial PR , the
> interactive_beam module will provide mainly 4 interfaces:
>
>- For advanced users who define pipeline outside __main__, let them
>tell current interactive environment where they define their pipeline:
>watch()
>   - This is very useful for tests where pipeline can be defined in
>   test methods.
>   - If the user simply creates pipeline in a Jupyter notebook or a
>   plain Python script, they don't have to know/use this feature at all.
>
>
This is for using visualize() below, or building further on the pipeline,
right?


>
>- Let users create an interactive pipeline: create_pipeline()
>   - invoking create_pipeline(), the user gets a Pipeline object that
>   works as any other Pipeline object created from apache_beam.Pipeline()
>   - However, the pipeline object p, when invoking p.run(), does some
>   extra interactive magic.
>   - We'll support interactive execution for DirectRunner at this
>   moment.
>
> How is this different than creating a pipeline with the interactive
runner? It'd be nice to reduce the number of new concepts a user needs to
know (and also reduce the number of changes needed to move from interactive
to non-interactive). Is there any need to limit this to the Direct runner?

>
>- Let users run the interactive pipeline as a normal pipeline:
>run_pipeline()
>   - In an interactive environment, a user only needs to add and
>   execute 1 line of code run_pipeline(pipeline) to execute any existing
>   interactive pipeline object as normal pipeline in any selected platform.
>   - We'll probably support Dataflow only. Other implementations can
>   be added though.
>
> Again, how is this different than pipeline.run()? What features require
limiting this to only certain runners?

>
>- Let users introspect any intermediate PCollection they have handler
>to: visualize()
>   - If a user ever writes pcoll = p | "Some Transform" >>
>   some_transform() ..., they can visualize(pcoll) once the pipeline p is
>   executed.
>   - p can be batch or streaming
>   - The visualization will be some plot graph of data for the given
>   PCollection as if it's materialized. If the PCollection is unbounded, 
> the
>   graph is dynamic.
>
> The PR will implement 1 and 2.
>
> We'll use https://issues.apache.org/jira/browse/BEAM-7923 as the top
> level JIRA and add blocking JIRAs as development goes.
>
> External Beam users will not worry about any of the underlying
> implementation details.
> Except the 4 interfaces above, they learn and write normal Beam code and
> can execute the pipeline immediately when they are done with prototyping.
>
> Ning.
>

Seattle Beam Meetup - Sep 26

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

Re: Inconsistent Results with GroupIntoBatches PTransform

Re: Jira email notifications

Re: (mini-doc) Beam (Flink) portable job templates

Re: (mini-doc) Beam (Flink) portable job templates

Re: (mini-doc) Beam (Flink) portable job templates

Re: (mini-doc) Beam (Flink) portable job templates

Re: [PROPOSAL] Standardize Gradle structure in Python SDK

Re: Inconsistent Results with GroupIntoBatches PTransform

Re: Dataflow worker overview graphs

Re: Dataflow worker overview graphs

Dataflow worker overview graphs

Re: Proposal for SDFs in the Go SDK

Re: Java 11 compatibility question

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

Re: Beam Python Portable Runner - Adding timeout to JobServer grpc calls

Re: Allowing firewalled/offline builds of Beam

Re: Allowing firewalled/offline builds of Beam

Allowing firewalled/offline builds of Beam

Inconsistent Results with GroupIntoBatches PTransform

Re: Jira email notifications

Re: Write-through-cache in State logic

Re: (mini-doc) Beam (Flink) portable job templates

Re: Caused by: java.lang.IllegalArgumentException: URI is not hierarchical

Re: Write-through-cache in State logic

Re: (mini-doc) Beam (Flink) portable job templates

Re: Java 11 compatibility question

Re: Proposal for SDFs in the Go SDK

Re: [ANNOUNCE] New committer: Kyle Weaver

Re: (mini-doc) Beam (Flink) portable job templates

Re: [ANNOUNCE] New committer: Rui Wang

Re: [PROPOSAL] An initial Schema API in Python

Re: Waht would be the best place for performance tests documentation?

Re: Brief of interactive Beam

35 matches

Site Navigation

Mail list logo

Footer information