Hazelcast Jet Runner

2019-02-14 Thread Can Gencer
We at Hazelcast are looking into writing a Beam runner for Hazelcast Jet (
https://github.com/hazelcast/hazelcast-jet). I wanted to introduce myself
as we'll likely have questions as we start development.

Some of the things I'm wondering about currently:

* Currently there seems to be a guide available at
https://beam.apache.org/contribute/runner-guide/ , is this up to date? Is
there anything in specific to be aware of when starting with a new runner
that's not covered here?
* Should we be targeting the latest master which is at 2.12-SNAPSHOT or a
stable version?
* After a runner is developed, how is the maintenance typically handled, as
the runners seems to be part of Beam codebase?


Re: Thoughts on a reference runner to invest in?

2019-02-14 Thread Kenneth Knowles
Interesting point about community and the fact that it didn't build a
Java-based ULR even though it has been a possibility for a long time.

It makes sense to me. A non-Java SDK needs portability to run on Beam's
distributed runners, so building the portable SDK harness is key, unlike
for Java. And to build it, a local portability-based runner is a great help
(can't really imagine doing it without one). And of course building it in
Python makes sense if you are steeped in Python.

Joking-but-not-Joking the best reference runner would probably be in some
less popular but very readable functional language so it is different from
every SDK :-). I've looked into it and discovered that gRPC support is not
great...

Kenn

On Thu, Feb 14, 2019 at 5:47 AM Robert Bradshaw  wrote:

> I think it's good to distinguish between direct runners (which would
> be good to have in every language, and can grow in sophistication with
> the userbase) and a fully universal reference runner. We should of
> course continue to grow and maintain the java-runners-core shared
> library, possibly as driven by the various production runners which
> has been the most productive to date. (The point about community is a
> good one. Unfortunately over the past 1.5 years the bigger Java
> community has not resulted in a more complete Java ULR (in terms of
> number of contributors or features/maturity), and it's unclear what
> would change that in the future.)
>
> It would be really great to have (at least) two completely separate
> implementations, but (at the moment at least) I see that as lower
> value than accelerating the efforts to get existing production runners
> onto portability.
>
> On Thu, Feb 14, 2019 at 2:01 PM Ismaël Mejía  wrote:
> >
> > This is a really interesting and important discussion. Having multiple
> > reference runners can have its pros and cons. It is all about
> > tradeoffs. From the end user point of view it can feel weird to deal
> > with tools and packaging of a different ecosystem, e.g. python devs
> > dealing with all the quirkiness of Java packaging, or the viceversa
> > Java developers dealing with pip and friends. So having a reference
> > runner per language would be more natural and help also valídate the
> > portability concept, however having multiple reference runners sounds
> > harder from the maintenance point of view.
> >
> > Most of the software in the domain of beam have been traditionally
> > written in Java so there is a BIG advantage of ready to use (and
> > mature) libraries and reusable components (also the reference runner
> > may profit of the librarires that Thomas and others in the community
> > have developed for multi runner s). This is a big win, but more
> > important, we can have more eyes looking and contributing improvemetns
> > and fixes that will benefit the reference runner and others.
> >
> > Having a reference runner per language would be nice but if we must
> > choose only one language I prefer it to be Java just because we have a
> > bigger community that can contribute and improve it. We may work on
> > making the distribution of such runner more easier or friendly for
> > users of different languages.
> >
> > On Wed, Feb 13, 2019 at 3:47 AM Robert Bradshaw 
> wrote:
> > >
> > > I agree, it's useful for runners that are used for tests (including
> testing SDKs) to push into the dark corners of what's allowed by the spec.
> I think this can be added (where they don't already exist) to existing
> non-production runners. (Whether a direct runner should be considered
> production or not depends on who you ask...)
> > >
> > > On Wed, Feb 13, 2019 at 2:49 AM Daniel Oliveira <
> danolive...@google.com> wrote:
> > >>
> > >> +1 to Kenn's point. Regardless of whether we go with a Python runner
> or a Java runner, I think we should have at least one portable runner that
> isn't a production runner for the reasons he outlined.
> > >>
> > >> As for the rest of the discussion, it sounds like people are
> generally supportive of having the Python FnApiRunner as that runner, and
> using Flink as a reference implementation for portability in Java.
> > >>
> > >> On Tue, Feb 12, 2019 at 4:37 PM Kenneth Knowles 
> wrote:
> > >>>
> > >>>
> > >>> On Tue, Feb 12, 2019 at 8:59 AM Thomas Weise  wrote:
> > 
> >  The Java ULR initially provided some value for the portability
> effort as Max mentions. It helped to develop the shared library for all
> Java runners and the job server functionality.
> > 
> >  However, I think the same could have been accomplished by
> developing the Flink runner instead of the Java ULR from the get go. This
> is also what happened later last year when support for state, timers and
> metrics was added to the portable Flink runner first and the ULR still does
> not support those features [1].
> > 
> >  Since all (or most) Java based runners that are based on another
> ASF project support embedded execution, I think it might make sense to
> discontinu

Re: [SQL] External schema providers

2019-02-14 Thread Kenneth Knowles
This is great. Being able to just simply query existing schema-fied data is
such a huge win.

Kenn

On Thu, Feb 14, 2019 at 12:30 PM Rui Wang  wrote:

> Thanks Anton for contributing it! It's a big progress that BeamSQL can
> connect to Hive metastore! The HCatalogTableProvider implementation is also
> a good reference for people who want to implement table provider for their
> metastore serivces.
>
> Just add another design discussion that I am aware of:
> Figure it out what's the better way to manage autosevice table provider
> registration approach and DDL approach in JDBC driver code path.
>
> -Rui
>
> On Thu, Feb 14, 2019 at 11:42 AM Anton Kedin  wrote:
>
>> Hi dev@,
>>
>> A quick update about a new Beam SQL feature.
>>
>> In short, we have wired up the support for plugging table providers
>> through Beam SQL API to allow obtaining table schemas from external sources.
>>
>> *What does it even mean?*
>>
>> Previously, in Java pipelines, you could apply a Beam SQL query to
>> existing PCollections. We have a special SqlTransform to do that, it
>> converts a SQL query to an equivalent PTransform that is applied to the
>> PCollection of Rows.
>>
>> One major inconvenience in this approach is that to query something, it
>> has to be a PCollection. I.e. you have to read the data from a specific
>> source and then convert it to rows. Which can mean multiple complications,
>> like potentially manually converting schemas from source to Beam, or having
>> a completely different logic when changing the source.
>>
>> The new API allows you to plug a schema provider that can resolve the
>> tables and schemas automatically if they already exist somewhere else. This
>> way Beam SQL, with the help of the provider, does the table lookup, then IO
>> configuration, and then schema conversion if needed.
>>
>> As an example, here's a query
>> [1]
>> that joins 2 existing PCollections with a table from Hive using
>> HCatalogTableProvider. Hive table lookup is automatic, the table
>> provider in this case will resolve the tables by talking to Hive Metastore
>> and will read the data by configuring and applying the HCatalogIO,
>> converting the records to Rows under the hood.
>>
>> *What's the status of this?*
>>
>> This is a working implementation, but the development is still ongoing,
>> there are bugs, API might change, and there are few more things I can see
>> coming related to this after further design discussions:
>>
>>  * refactor of the underlying table/metadata provider code;
>>  * working out the design for supporting creating / updating the tables
>> in the metadata provider;
>>  * creating a DDL syntax for it;
>>  * creating more providers;
>>
>> [1]
>> https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190
>>
>


Re: Signing off

2019-02-14 Thread Kenneth Knowles
+1

Thanks for the contributions to community & code, and enjoy the new chapter!

Kenn

On Thu, Feb 14, 2019 at 3:25 PM Thomas Weise  wrote:

> Hi Scott,
>
> Thank you for the many contributions to Beam and best of luck with the new
> endeavor!
>
> Thomas
>
>
> On Thu, Feb 14, 2019 at 10:37 AM Scott Wegner  wrote:
>
>> I wanted to let you all know that I've decided to pursue a new adventure
>> in my career, which will take me away from Apache Beam development.
>>
>> It's been a fun and fulfilling journey. Apache Beam has been my first
>> significant experience working in open source. I'm inspired observing how
>> the community has come together to deliver something great.
>>
>> Thanks for everything. If you're curious what's next: I'll be working on
>> Federated Learning at Google:
>> https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
>>
>> Take care,
>> Scott
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>


Re: [PROPOSAL] Prepare Beam 2.11.0 release

2019-02-14 Thread Ahmet Altay
Update: Post-commit tests were mostly green and I cut the branch earlier
today. There are 27 issues on the tagged for this release [1], I will start
triage them. I would appreciate if you can start moving non-release
blocking issues from this list.

[1] https://issues.apache.org/jira/projects/BEAM/versions/12344775

On Mon, Feb 11, 2019 at 1:26 PM Sam Rohde  wrote:

> Thanks Ahmet! The 2.11.0 release will also be using the revised release
> process from PR-7529  that I
> authored. Let me know if you have any questions or if I can help in any
> way. I would love feedback on how to improve on the modifications I made
> and the release process in general.
>
> On Fri, Feb 8, 2019 at 9:20 AM Scott Wegner  wrote:
>
>> +1, thanks Ahmet! Let us know if you need any help.
>>
>>
>> On Fri, Feb 8, 2019 at 8:09 AM Alan Myrvold  wrote:
>>
>>> Thanks for volunteering. Glad to see the schedule being kept.
>>>
>>> On Fri, Feb 8, 2019 at 5:58 AM Maximilian Michels 
>>> wrote:
>>>
 +1 We should keep up the release cycle even though we are a bit late
 with 2.10.0

 On 08.02.19 02:10, Andrew Pilloud wrote:
 > +1 let's keep things going. Thanks for volunteering!
 >
 > Andrew
 >
 > On Thu, Feb 7, 2019, 4:49 PM Kenneth Knowles >>> > > wrote:
 >
 > I think this is a good idea. Even though 2.10.0 is still making
 it's
 > way out, there's still 6 weeks of changes between the cut dates,
 > lots of value. It is good to keep the rhythm and follow the
 calendar.
 >
 > Kenn
 >
 > On Thu, Feb 7, 2019, 15:52 Ahmet Altay >>> >  wrote:
 >
 > Hi all,
 >
 > Beam 2.11 release branch cut date is 2/13 according to the
 > release calendar [1]. I would like to volunteer myself to do
 > this release. I intend to cut the branch on the planned 2/13
 date.
 >
 > If you have releasing blocking issues for 2.11 please mark
 their
 > "Fix Version" as 2.11.0. (I also created a 2.12.0 release in
 > JIRA in case you would like to move any non-blocking issues to
 > that version.)
 >
 > What do you think?
 >
 > Ahmet
 >
 > [1]
 >
 https://calendar.google.com/calendar/embed?src=0p73sl034k80oob7seouanigd0%40group.calendar.google.com&ctz=America%2FLos_Angeles
 >

>>>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>


Re: Signing off

2019-02-14 Thread Thomas Weise
Hi Scott,

Thank you for the many contributions to Beam and best of luck with the new
endeavor!

Thomas


On Thu, Feb 14, 2019 at 10:37 AM Scott Wegner  wrote:

> I wanted to let you all know that I've decided to pursue a new adventure
> in my career, which will take me away from Apache Beam development.
>
> It's been a fun and fulfilling journey. Apache Beam has been my first
> significant experience working in open source. I'm inspired observing how
> the community has come together to deliver something great.
>
> Thanks for everything. If you're curious what's next: I'll be working on
> Federated Learning at Google:
> https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
>
> Take care,
> Scott
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: [SQL] External schema providers

2019-02-14 Thread Rui Wang
Thanks Anton for contributing it! It's a big progress that BeamSQL can
connect to Hive metastore! The HCatalogTableProvider implementation is also
a good reference for people who want to implement table provider for their
metastore serivces.

Just add another design discussion that I am aware of:
Figure it out what's the better way to manage autosevice table provider
registration approach and DDL approach in JDBC driver code path.

-Rui

On Thu, Feb 14, 2019 at 11:42 AM Anton Kedin  wrote:

> Hi dev@,
>
> A quick update about a new Beam SQL feature.
>
> In short, we have wired up the support for plugging table providers
> through Beam SQL API to allow obtaining table schemas from external sources.
>
> *What does it even mean?*
>
> Previously, in Java pipelines, you could apply a Beam SQL query to
> existing PCollections. We have a special SqlTransform to do that, it
> converts a SQL query to an equivalent PTransform that is applied to the
> PCollection of Rows.
>
> One major inconvenience in this approach is that to query something, it
> has to be a PCollection. I.e. you have to read the data from a specific
> source and then convert it to rows. Which can mean multiple complications,
> like potentially manually converting schemas from source to Beam, or having
> a completely different logic when changing the source.
>
> The new API allows you to plug a schema provider that can resolve the
> tables and schemas automatically if they already exist somewhere else. This
> way Beam SQL, with the help of the provider, does the table lookup, then IO
> configuration, and then schema conversion if needed.
>
> As an example, here's a query
> [1]
> that joins 2 existing PCollections with a table from Hive using
> HCatalogTableProvider. Hive table lookup is automatic, the table provider
> in this case will resolve the tables by talking to Hive Metastore and will
> read the data by configuring and applying the HCatalogIO, converting the
> records to Rows under the hood.
>
> *What's the status of this?*
>
> This is a working implementation, but the development is still ongoing,
> there are bugs, API might change, and there are few more things I can see
> coming related to this after further design discussions:
>
>  * refactor of the underlying table/metadata provider code;
>  * working out the design for supporting creating / updating the tables in
> the metadata provider;
>  * creating a DDL syntax for it;
>  * creating more providers;
>
> [1]
> https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190
>


Re: Signing off

2019-02-14 Thread Tim Robertson
What a shame for the project but best of luck for the future Scott.

Thanks for all your contributions - they have been significant!
Tim

On Thu, Feb 14, 2019 at 7:37 PM Scott Wegner  wrote:

> I wanted to let you all know that I've decided to pursue a new adventure
> in my career, which will take me away from Apache Beam development.
>
> It's been a fun and fulfilling journey. Apache Beam has been my first
> significant experience working in open source. I'm inspired observing how
> the community has come together to deliver something great.
>
> Thanks for everything. If you're curious what's next: I'll be working on
> Federated Learning at Google:
> https://ai.googleblog.com/2017/04/federated-learning-collaborative.html
>
> Take care,
> Scott
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


[SQL] External schema providers

2019-02-14 Thread Anton Kedin
Hi dev@,

A quick update about a new Beam SQL feature.

In short, we have wired up the support for plugging table providers through
Beam SQL API to allow obtaining table schemas from external sources.

*What does it even mean?*

Previously, in Java pipelines, you could apply a Beam SQL query to existing
PCollections. We have a special SqlTransform to do that, it converts a SQL
query to an equivalent PTransform that is applied to the PCollection of Rows
.

One major inconvenience in this approach is that to query something, it has
to be a PCollection. I.e. you have to read the data from a specific source
and then convert it to rows. Which can mean multiple complications, like
potentially manually converting schemas from source to Beam, or having a
completely different logic when changing the source.

The new API allows you to plug a schema provider that can resolve the
tables and schemas automatically if they already exist somewhere else. This
way Beam SQL, with the help of the provider, does the table lookup, then IO
configuration, and then schema conversion if needed.

As an example, here's a query
[1]
that joins 2 existing PCollections with a table from Hive using
HCatalogTableProvider. Hive table lookup is automatic, the table provider
in this case will resolve the tables by talking to Hive Metastore and will
read the data by configuring and applying the HCatalogIO, converting the
records to Rows under the hood.

*What's the status of this?*

This is a working implementation, but the development is still ongoing,
there are bugs, API might change, and there are few more things I can see
coming related to this after further design discussions:

 * refactor of the underlying table/metadata provider code;
 * working out the design for supporting creating / updating the tables in
the metadata provider;
 * creating a DDL syntax for it;
 * creating more providers;

[1]
https://github.com/apache/beam/blob/116600f32013620e748723b8022a7023fa8e2528/sdks/java/extensions/sql/src/test/java/org/apache/beam/sdk/extensions/sql/BeamSqlHiveSchemaTest.java#L175,L190


Signing off

2019-02-14 Thread Scott Wegner
I wanted to let you all know that I've decided to pursue a new adventure in
my career, which will take me away from Apache Beam development.

It's been a fun and fulfilling journey. Apache Beam has been my first
significant experience working in open source. I'm inspired observing how
the community has come together to deliver something great.

Thanks for everything. If you're curious what's next: I'll be working on
Federated Learning at Google:
https://ai.googleblog.com/2017/04/federated-learning-collaborative.html

Take care,
Scott



Got feedback? tinyurl.com/swegner-feedback


Re: pipeline steps

2019-02-14 Thread Yi Pan
@Chamikara, if adding the metadata interface class is too much an effort
now, I would accept the solution with some special PTransform method that
adds the metadata to the output data types. What I wonder is that if this
kind of PTransform becomes more popular to many different BeamIO's, I may
as well extend the existing BeamIO classes to automatically apply this
PTransform before we deliver the raw message types from IO (e.g.
Samza's SystemDescriptor implementation follows this pattern by allowing an
optional InputTransformer). Maybe we can wait to discuss when this use case
becomes more popular.

Best,

-Yi

On Tue, Feb 12, 2019 at 4:52 PM Chamikara Jayalath 
wrote:

> I think introducing a class hierarchy for extracting metadata from IO
> connectors might end up being an overkill. I think what we need to do is to
> add new transforms to various IO connectors that would return associated
> metadata along with each record. This will be fine performance-wise as well
> since current transforms will not be affected.  Each source will end up
> having it's own implementation (as needed) but file-based source transforms
> will end-up sharing a bunch of code here since they share underlying code
> for handling files. For example, we could add a,
>
> *public class ReadAllViaFileBasedSource extends
> PTransform, PCollection>>*
>
> where *KV* represents filename and the original record.
>
> One caveat for file-based sources though is that we won't be able to
> support Read transforms since we basically feed in a glob and get a bunch
> of records (splitting happens within the source so composite transform is
> not aware of exact files that would produce records unless we implement a
> new source). ReadFiles/ReadAll transforms should be more flexible and we
> should be able to adapt them to support returning file-names (and they'll
> have the full power of Read transforms after SDF).
>
> Thanks,
> Cham
>
>
> On Tue, Feb 12, 2019 at 12:42 PM Yi Pan  wrote:
>
>> The "general way" is what I was hoping to convey the idea (apologize if
>> KafkaIO is not a good example for that). More specifically, if KafkaIO
>> returns metadata in KafkaRecord, and somehow, FileIO returns a file name
>> associated with each record, it seems that it would make sense to define a
>> general interface of metadata for each record across different BeamIO, as
>> an optional envelope information. i.e. IOContext is the interface
>> associated with each record and KafkaIO implements KafkaIOContext, and
>> FileIO implements FileIOContext, etc.
>>
>> Best.
>>
>> On Mon, Feb 11, 2019 at 3:14 AM Alexey Romanenko <
>> aromanenko@gmail.com> wrote:
>>
>>> Talking about KafkaIO, it’s already possible to have this since*
>>> "apply(KafkaIO.read())"* returns
>>> *"PCollection>”* where  *KafkaRecord* contains
>>> message metadata (topic, partition, etc).
>>> Though, it works _only_ if* “withoutMetadata()”*  was not used before -
>>> in this case it will return simple *KV*.
>>>
>>> In the same time, I agree that it would be useful to have some general
>>> way to obtain meta information of records across all Beam IOs.
>>>
>>> On 7 Feb 2019, at 18:25, Yi Pan  wrote:
>>>
>>> Shouldn't this apply to more generic scenario for any BeamIO? For
>>> example, I am using KafkaIO and wanted to get the topic and partition from
>>> which the message was received. Some IOContext associated with each data
>>> unit from BeamIO may be useful here?
>>>
>>> -Yi
>>>
>>> On Thu, Feb 7, 2019 at 6:29 AM Kenneth Knowles  wrote:
>>>
 This comes up a lot, wanting file names alongside the data that came
 from the file. It is a historical quirk that none of our connectors used to
 have the file names. What is the change needed for FileIO + parse Avro to
 be really easy to use?

 Kenn

 On Thu, Feb 7, 2019 at 6:18 AM Jeff Klukas  wrote:

> I haven't needed to do this with Beam before, but I've definitely had
> similar needs in the past. Spark, for example, provides an input_file_name
> function that can be applied to a dataframe to add the input file as an
> additional column. It's not clear to me how that's implemented, though.
>
> Perhaps others have suggestions, but I'm not aware of a way to do this
> conveniently in Beam today. To my knowledge, today you would have to use
> FileIO.match() and FileIO.readMatches() to get a collection of
> ReadableFile. You'd then have to FlatMapElements to pull out the metadata
> and the bytes of the file, and you'd be responsible for parsing those 
> bytes
> into avro records. You'd  be able to output something like a KV
> that groups the file name together with the parsed avro record.
>
> Seems like something worth providing better support for in Beam itself
> if this indeed doesn't already exist.
>
> On Thu, Feb 7, 2019 at 7:29 AM Chaim Turkel  wrote:
>
>> Hi,
>>   I am working on a pipeline that listens to a topic on pubsub to g

[BEAM-6671] Possible dependency issue in 2.9.0 NoSuchFieldError

2019-02-14 Thread Alex Amato
Filed this:
https://issues.apache.org/jira/browse/BEAM-6671

I received a report from a Dataflow user encountering this in Beam 2.9.0
when creating a spanner instance. I wanted to post this here as this is
known to be related to dependency conflicts in the past (
https://stackoverflow.com/questions/46684071/error-using-spannerio-in-apache-beam).
Does anyone have an idea of the root cause here? I am trying to get a bit
more information from the user in the meantime, to see if they added any
extra deps of their own. But I wanted to mention it here as well.


java.lang.NoSuchFieldError:
internal_static_google_rpc_LocalizedMessage_fieldAccessorTable
at
com.google.rpc.LocalizedMessage.internalGetFieldAccessorTable(LocalizedMessage.java:90)
at
com.google.protobuf.GeneratedMessageV3.getDescriptorForType(GeneratedMessageV3.java:121)
at io.grpc.protobuf.ProtoUtils.keyForProto(ProtoUtils.java:67)
at
com.google.cloud.spanner.spi.v1.SpannerErrorInterceptor.(SpannerErrorInterceptor.java:47)
at
com.google.cloud.spanner.spi.v1.GrpcSpannerRpc.(GrpcSpannerRpc.java:136)
at
com.google.cloud.spanner.SpannerOptions$DefaultSpannerRpcFactory.create(SpannerOptions.java:73)


Re: Beam Python streaming pipeline on Flink Runner

2019-02-14 Thread Maximilian Michels
I've revised the document and included your feedback: 
https://s.apache.org/beam-cross-language-io


I think it reads much better now. I moved away from the JSON 
configuration in favor of an explicit Proto-based configuration approach 
which leaves it up to the transform what to include in the Proto 
payload. Transforms could still be using JSON if it is feasible.


Let me know what you think.

-Max

On 08.02.19 16:07, Robert Bradshaw wrote:
On Fri, Feb 8, 2019 at 3:52 PM Maximilian Michels > wrote:


Thank you for your comments. They will help to iterate over the ideas.

 > I'd like to point out that though this seems to be specifically
targeting IOs, there's nothing here that is specific to IOs.

I thought about this from an IO perspective but I agree that it amounts
to using any type of cross-language transform. However, there is the IO
specific question of how to parameterize IO transforms. 



On the other hand, I don't think paramaterizing IO transforms is 
different than parameterizing any other kind of transform.


If we end up
using PipelineOptions for that, there would really be no difference
anymore.


FWIW, I'm not suggesting that we use the pipeline-level PipelineOptions 
(as this makes it impossible to configure the same transform differently 
in different parts of the transform). Rather, this is an existing 
cross-langauge configuration object/pattern we could possibly leverage. 
(JSON is much easier to provide from Python (and likely other 
languages), but it's harder to know how to read it from the Java side, 
especially if it has UDFs, coders, schemas, etc. embedded.)


@Ismael Will add a link after another review round.

On 08.02.19 09:25, Robert Bradshaw wrote:
 > Thanks for writing this up. I'd like to point out that though
this seems
 > to be specifically targeting IOs, there's nothing here that is
specific
 > to IOs.
 >
 > On Thu, Feb 7, 2019 at 9:30 PM Chamikara Jayalath
mailto:chamik...@google.com>
 > >> wrote:
 >
 >     Thanks Max. Added few comments.
 >
 >     - Cham
 >
 >     On Thu, Feb 7, 2019 at 11:18 AM Ismaël Mejía
mailto:ieme...@gmail.com>
 >     >> wrote:
 >
 >         Can you please add the link to the design document webpage.
 >
 >         Le jeu. 7 févr. 2019 à 19:59, Maximilian Michels
mailto:m...@apache.org>
 >         >> a écrit :
 >
 >             I've created an initial design document:
 > https://s.apache.org/beam-cross-language-io
 >
 >             It does not contain all the details but perhaps it's
a good
 >             basis for a
 >             discussion on how we proceed.
 >
 >             -Max
 >
 >             On 06.02.19 19:49, Chamikara Jayalath wrote:
 >              >
 >              >
 >              > On Wed, Feb 6, 2019 at 8:38 AM Maximilian Michels
 >             mailto:m...@apache.org>
>
 >              > 
              >
 >              >     Thanks for your replies Robert and Cham.
 >              >
 >              >     What I had in mind was a generic Wrapper that
would
 >             easily allow users
 >              >     to use IO from Java. Such wrapper could start
as an
 >             experimental
 >              >     feature
 >              >     and then, through URN versioning, become stable
 >             eventually.
 >              >
 >              >     UDFs are needed, though they are a special
case. Most
 >             users (including
 >              >     Matthias) just want to specify a few String
options
 >             which do not
 >              >     require
 >              >     UDFs but something along the lines what I
proposed here.
 >              >
 >              >
 >              > Sounds good let's start documenting/implementing the
 >             "easy" case and
 >              > think bit more regarding UDFs.
 >              >
 >              >
 >              >
 >              >     Robert wrote:
 >              >      > UDFs that are called from within an IO as
part of
 >             its operation is
 >              >      > still an open question.
 >              >
 >              >     Exactly. How about we solve the easier case first,
 >             unblock users, and
 >              >     then think more about solving the general case?
 >              >
 >              >     Cham wrote:
 >        

Re: Thoughts on a reference runner to invest in?

2019-02-14 Thread Robert Bradshaw
I think it's good to distinguish between direct runners (which would
be good to have in every language, and can grow in sophistication with
the userbase) and a fully universal reference runner. We should of
course continue to grow and maintain the java-runners-core shared
library, possibly as driven by the various production runners which
has been the most productive to date. (The point about community is a
good one. Unfortunately over the past 1.5 years the bigger Java
community has not resulted in a more complete Java ULR (in terms of
number of contributors or features/maturity), and it's unclear what
would change that in the future.)

It would be really great to have (at least) two completely separate
implementations, but (at the moment at least) I see that as lower
value than accelerating the efforts to get existing production runners
onto portability.

On Thu, Feb 14, 2019 at 2:01 PM Ismaël Mejía  wrote:
>
> This is a really interesting and important discussion. Having multiple
> reference runners can have its pros and cons. It is all about
> tradeoffs. From the end user point of view it can feel weird to deal
> with tools and packaging of a different ecosystem, e.g. python devs
> dealing with all the quirkiness of Java packaging, or the viceversa
> Java developers dealing with pip and friends. So having a reference
> runner per language would be more natural and help also valídate the
> portability concept, however having multiple reference runners sounds
> harder from the maintenance point of view.
>
> Most of the software in the domain of beam have been traditionally
> written in Java so there is a BIG advantage of ready to use (and
> mature) libraries and reusable components (also the reference runner
> may profit of the librarires that Thomas and others in the community
> have developed for multi runner s). This is a big win, but more
> important, we can have more eyes looking and contributing improvemetns
> and fixes that will benefit the reference runner and others.
>
> Having a reference runner per language would be nice but if we must
> choose only one language I prefer it to be Java just because we have a
> bigger community that can contribute and improve it. We may work on
> making the distribution of such runner more easier or friendly for
> users of different languages.
>
> On Wed, Feb 13, 2019 at 3:47 AM Robert Bradshaw  wrote:
> >
> > I agree, it's useful for runners that are used for tests (including testing 
> > SDKs) to push into the dark corners of what's allowed by the spec. I think 
> > this can be added (where they don't already exist) to existing 
> > non-production runners. (Whether a direct runner should be considered 
> > production or not depends on who you ask...)
> >
> > On Wed, Feb 13, 2019 at 2:49 AM Daniel Oliveira  
> > wrote:
> >>
> >> +1 to Kenn's point. Regardless of whether we go with a Python runner or a 
> >> Java runner, I think we should have at least one portable runner that 
> >> isn't a production runner for the reasons he outlined.
> >>
> >> As for the rest of the discussion, it sounds like people are generally 
> >> supportive of having the Python FnApiRunner as that runner, and using 
> >> Flink as a reference implementation for portability in Java.
> >>
> >> On Tue, Feb 12, 2019 at 4:37 PM Kenneth Knowles  wrote:
> >>>
> >>>
> >>> On Tue, Feb 12, 2019 at 8:59 AM Thomas Weise  wrote:
> 
>  The Java ULR initially provided some value for the portability effort as 
>  Max mentions. It helped to develop the shared library for all Java 
>  runners and the job server functionality.
> 
>  However, I think the same could have been accomplished by developing the 
>  Flink runner instead of the Java ULR from the get go. This is also what 
>  happened later last year when support for state, timers and metrics was 
>  added to the portable Flink runner first and the ULR still does not 
>  support those features [1].
> 
>  Since all (or most) Java based runners that are based on another ASF 
>  project support embedded execution, I think it might make sense to 
>  discontinue separate direct runners for Java and instead focus efforts 
>  on making the runners that folks would also use in production better?
> >>>
> >>>
> >>> Caveat: if people only test using embedded execution of a production 
> >>> runner, they are quite likely to depend on quirks of that runner, such as 
> >>> bundle size, fusion, whether shuffle is also checkpoint, etc. I think 
> >>> there's a lot of value in an antagonistic testing runner, which is 
> >>> something the Java DirectRunner tried to do with GBK random ordering, 
> >>> checking illegal mutations, checking encodability. These were all driven 
> >>> by real user needs and each caught a lot of user bugs. That said, I 
> >>> wouldn't want to maintain an extra runner, but would like to put these 
> >>> into a portable runner, whichever it is.
> >>>
> >>> Kenn
> >>>
> 
> 
> 

Re: Thoughts on a reference runner to invest in?

2019-02-14 Thread Ismaël Mejía
This is a really interesting and important discussion. Having multiple
reference runners can have its pros and cons. It is all about
tradeoffs. From the end user point of view it can feel weird to deal
with tools and packaging of a different ecosystem, e.g. python devs
dealing with all the quirkiness of Java packaging, or the viceversa
Java developers dealing with pip and friends. So having a reference
runner per language would be more natural and help also valídate the
portability concept, however having multiple reference runners sounds
harder from the maintenance point of view.

Most of the software in the domain of beam have been traditionally
written in Java so there is a BIG advantage of ready to use (and
mature) libraries and reusable components (also the reference runner
may profit of the librarires that Thomas and others in the community
have developed for multi runner s). This is a big win, but more
important, we can have more eyes looking and contributing improvemetns
and fixes that will benefit the reference runner and others.

Having a reference runner per language would be nice but if we must
choose only one language I prefer it to be Java just because we have a
bigger community that can contribute and improve it. We may work on
making the distribution of such runner more easier or friendly for
users of different languages.

On Wed, Feb 13, 2019 at 3:47 AM Robert Bradshaw  wrote:
>
> I agree, it's useful for runners that are used for tests (including testing 
> SDKs) to push into the dark corners of what's allowed by the spec. I think 
> this can be added (where they don't already exist) to existing non-production 
> runners. (Whether a direct runner should be considered production or not 
> depends on who you ask...)
>
> On Wed, Feb 13, 2019 at 2:49 AM Daniel Oliveira  
> wrote:
>>
>> +1 to Kenn's point. Regardless of whether we go with a Python runner or a 
>> Java runner, I think we should have at least one portable runner that isn't 
>> a production runner for the reasons he outlined.
>>
>> As for the rest of the discussion, it sounds like people are generally 
>> supportive of having the Python FnApiRunner as that runner, and using Flink 
>> as a reference implementation for portability in Java.
>>
>> On Tue, Feb 12, 2019 at 4:37 PM Kenneth Knowles  wrote:
>>>
>>>
>>> On Tue, Feb 12, 2019 at 8:59 AM Thomas Weise  wrote:

 The Java ULR initially provided some value for the portability effort as 
 Max mentions. It helped to develop the shared library for all Java runners 
 and the job server functionality.

 However, I think the same could have been accomplished by developing the 
 Flink runner instead of the Java ULR from the get go. This is also what 
 happened later last year when support for state, timers and metrics was 
 added to the portable Flink runner first and the ULR still does not 
 support those features [1].

 Since all (or most) Java based runners that are based on another ASF 
 project support embedded execution, I think it might make sense to 
 discontinue separate direct runners for Java and instead focus efforts on 
 making the runners that folks would also use in production better?
>>>
>>>
>>> Caveat: if people only test using embedded execution of a production 
>>> runner, they are quite likely to depend on quirks of that runner, such as 
>>> bundle size, fusion, whether shuffle is also checkpoint, etc. I think 
>>> there's a lot of value in an antagonistic testing runner, which is 
>>> something the Java DirectRunner tried to do with GBK random ordering, 
>>> checking illegal mutations, checking encodability. These were all driven by 
>>> real user needs and each caught a lot of user bugs. That said, I wouldn't 
>>> want to maintain an extra runner, but would like to put these into a 
>>> portable runner, whichever it is.
>>>
>>> Kenn
>>>


 As for Python (and hopefully soon Go), it makes a lot of sense to have a 
 simple to use and stable runner that can be used for local development. At 
 the moment, the Py FnApiRunner seems the best candidate to serve as 
 reference for portability.

 On a related note, we should probably also consider making pure Java 
 pipeline execution via portability framework on a Java runner simpler and 
 more efficient. We already use embedded environment for testing. If we 
 also inline/embed the job server and this becomes readily available and 
 easy to use, it might improve chances of other runners migrating to 
 portability sooner.

 Thomas

 [1] https://s.apache.org/apache-beam-portability-support-table



 On Tue, Feb 12, 2019 at 3:34 AM Maximilian Michels  wrote:
>
> Do you consider job submission and artifact staging part of the
> ReferenceRunner? If so, these parts have been reused or served as a
> model for the portable FlinkRunner. So they had some value.
>
> A r