Re: [thread fork] Apache Beam & Google Cloud Dataflow

2016-06-16 Thread Frances Perry
With my Google employee hat on, I'd like to soften that claim a little ;-)

Currently, the Beam SDK runs again Google Cloud Dataflow. But since Beam
isn't itself ready for prime time yet, Google doesn't officially provide
support for running Beam on Cloud Dataflow right now, and Google Cloud
Dataflow customers should still use the original Dataflow Java SDK.

But I, for one, am looking forward to this evolving over the next few
months as Beam stabilizes ;-D


On Thu, Jun 16, 2016 at 9:50 PM, Jean-Baptiste Onofré 
wrote:

> Hi,
>
> as soon as you use the Beam dataflow runner, it should work smoothly.
>
> Regards
> JB
>
>
> On 06/16/2016 10:05 PM, Ismaël Mejía wrote:
>
>> Hello,
>>
>> One additional comment / question. I just noticed that Beam users already
>> can write their Beam Pipelines and execute them in the google dataflow
>> runner.
>>
>> I just did the test today and I was thrilled to confirm that it worked (as
>> JB told me).
>>
>> You can look at the SDK version in the image:
>> https://imgur.com/k9HnLnv
>>
>> The question is, is this some kind of beta, or is this going to be
>> supported during the transition (before the formal release 1.0) ? I ask
>> this because I suppose many current google users hesitate to move to Beam
>> for the moment because they don't know that they can already run their
>> pipelines in the Google Cloud Dataflow service. I think this is a good
>> idea
>> to encourage users to move their data processing pipelines into the Beam
>> version.
>>
>> Regards,
>> Ismaël
>>
>>
>>
>>
>> On Wed, Jun 15, 2016 at 11:21 PM, James Malone <
>> jamesmal...@google.com.invalid> wrote:
>>
>> Hi everyone,
>>>
>>> This is a thread fork from the email thread titled '[dev] Announcing
>>> 0.1.0-incubating release'.
>>>
>>> In that thread, Amir posed a good question:
>>>
>>> Why is still "Google Cloud Dataflow" included in the Beam release if
>>> Beam is indeed
>>> an evolution (super-set?) of "Google Cloud Dataflow".Thanks
>>> +regards,Amir-
>>>
>>> Many parts of Apache Beam are based on work from Google Cloud Dataflow,
>>> including the Dataflow (now Beam) model, SDKs (Java and Python), and some
>>> of the runners. This work was combined with awesome contributions from
>>> other groups (data Artisans/Apache Flink, Cloudera & PayPal/Apache Spark,
>>> etc.) to form the basis for Apache Beam[1]. Originally, the Cloud
>>> Dataflow
>>> SDK included machinery so Dataflow pipelines could be executed on Google
>>> Cloud Dataflow.
>>>
>>> An important part of Apache Beam is the ability to execute Beam pipelines
>>> on many runners (see the compatibility matrix[2] for full details and
>>> support.) The Beam project includes a runner for Google Cloud Dataflow,
>>> along with others, such as runners for Apache Flink and Apache Spark.
>>> We're
>>> also focused (and excited!) to support and grow new runners. As a
>>> seperate
>>> runner, the work for supporting execution on Cloud Dataflow can be
>>> separated into the runner from the larger Apache Beam effort.
>>>
>>> So, to summarize:
>>>
>>> Beam is based on work from Google Cloud Dataflow so it's definitely an
>>> evolution. Additionally, Beam includes a runner (one of many) for
>>> Google's
>>> Cloud Dataflow service.
>>>
>>> Hope that helps!
>>>
>>> James
>>>
>>> [1]: http://wiki.apache.org/incubator/BeamProposal
>>> [2]: http://beam.incubator.apache.org/capability-matrix
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] New Beam website design?

2016-06-16 Thread Jean-Baptiste Onofré
Good point. It make sense to wait that it's actually implemented and 
available before putting on the website.


Thanks !
Regards
JB

On 06/17/2016 07:01 AM, Frances Perry wrote:

Good thoughts.

There'll be a section on IO that comes over as part of the programming
guide being ported from the Cloud Dataflow docs. Hopefully that has the
technical info needed. Once we see how that's structured (Devin is still
playing around with a single page or multiple), then we can decide if we
need to make it more visible. Maybe we should add a summary table to the
overview page too?

As for DSLs, I think it remains to be seen how we tightly we choose to
integrate them into Beam. As we've started discussing before, we may decide
that some of them belong elsewhere because they change the user-visible
concepts, but should be discoverable from our documentation. Or others may
more closely align and just expose subsets. But in any case -- totally
agree we should add the right concepts when we cross that bridge.



On Thu, Jun 16, 2016 at 9:53 PM, Jean-Baptiste Onofré 
wrote:


Hi Frances,

great doc !

Maybe in the "Learn" section, we can also add IOs (like SDKs, and
runners), like we do in Camel (http://camel.apache.org/components.html)
For the SDKs, I would also add DSLs in the same section.

WDYT ?

Regards
JB


On 06/17/2016 12:21 AM, Frances Perry wrote:


Good point, JB -- let's redo the page layout as well.

I started with your proposal and tweaked it a bit to add in more details
and divide things a bit more according to use case (end user vs.
runner/sdk
developer):

https://docs.google.com/document/d/1-0jMv7NnYp0Ttt4voulUMwVe_qjBYeNMLm2LusYF3gQ/edit

Let me know what you think, and what part you'd like to drive! I'd suggest
we get the new section layout set this week, so we can parallelize site
design and assorted page content.

On Mon, Jun 6, 2016 at 8:38 AM, Jean-Baptiste Onofré 
wrote:

Hi James,


very good idea !

Couple of month ago, I completely revamped the Karaf website:

http://karaf.apache.org/

It could be a good skeleton in term of sections/pages.

IMHO, for Beam, at least for the home page, we should have:
1. a clear message about what Beam is from an user perspective: why
should
I use Beam and write pipelines, what's the value, etc. The runner
writers,
or DSL writers will find their resources but not on the homepage (on
dedicated section of the website).

In term of sections, we could propose
1.1. Overview (with the three perspective/type of users)
1.2. Libraries: SDKs, DSLs, IOs, Runners
1.3. Documentation: Dev Guide, Samples, Runners Writer guide, ...
1.4. Community: mailing list, contribution guide, ...
1.5. Apache (link to ASF)

2. a look'n feel should be clean and professional, at least for the home
page.

I would love to help here !

Regards
JB


On 06/06/2016 05:29 PM, James Malone wrote:

Hello everyone!


The current design of the Apache Beam website[1] is based on the a basic
Bootstrap/Jekyll theme. While this made getting an initial site out
quickly
pretty easy, the site itself is a little bland (in my opinion :). I
propose
we create a new design (layout templates, color schemes, visual design)
for
the Beam website.

Since the website is currently using Bootstrap and Jekyll, this should
be
a
relatively easy process. Getting this done will require a new design and
some CSS/HTML work. Additionally, before a design is put in place, I
think
it makes sense to discuss any ideas about a future design first.

So, I think there are two open questions behind this proposal:

1. Is there anyone within the community who would be interested in
creating
a design proposal or two and sharing them with the community?
2. Are there any ideas, opinions, and thoughts around what the design of
the site *should* be?

What does everyone think?

Cheers!

James

[1]: http://beam.incubator.apache.org


--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] New Beam website design?

2016-06-16 Thread Frances Perry
Good thoughts.

There'll be a section on IO that comes over as part of the programming
guide being ported from the Cloud Dataflow docs. Hopefully that has the
technical info needed. Once we see how that's structured (Devin is still
playing around with a single page or multiple), then we can decide if we
need to make it more visible. Maybe we should add a summary table to the
overview page too?

As for DSLs, I think it remains to be seen how we tightly we choose to
integrate them into Beam. As we've started discussing before, we may decide
that some of them belong elsewhere because they change the user-visible
concepts, but should be discoverable from our documentation. Or others may
more closely align and just expose subsets. But in any case -- totally
agree we should add the right concepts when we cross that bridge.



On Thu, Jun 16, 2016 at 9:53 PM, Jean-Baptiste Onofré 
wrote:

> Hi Frances,
>
> great doc !
>
> Maybe in the "Learn" section, we can also add IOs (like SDKs, and
> runners), like we do in Camel (http://camel.apache.org/components.html)
> For the SDKs, I would also add DSLs in the same section.
>
> WDYT ?
>
> Regards
> JB
>
>
> On 06/17/2016 12:21 AM, Frances Perry wrote:
>
>> Good point, JB -- let's redo the page layout as well.
>>
>> I started with your proposal and tweaked it a bit to add in more details
>> and divide things a bit more according to use case (end user vs.
>> runner/sdk
>> developer):
>>
>> https://docs.google.com/document/d/1-0jMv7NnYp0Ttt4voulUMwVe_qjBYeNMLm2LusYF3gQ/edit
>>
>> Let me know what you think, and what part you'd like to drive! I'd suggest
>> we get the new section layout set this week, so we can parallelize site
>> design and assorted page content.
>>
>> On Mon, Jun 6, 2016 at 8:38 AM, Jean-Baptiste Onofré 
>> wrote:
>>
>> Hi James,
>>>
>>> very good idea !
>>>
>>> Couple of month ago, I completely revamped the Karaf website:
>>>
>>> http://karaf.apache.org/
>>>
>>> It could be a good skeleton in term of sections/pages.
>>>
>>> IMHO, for Beam, at least for the home page, we should have:
>>> 1. a clear message about what Beam is from an user perspective: why
>>> should
>>> I use Beam and write pipelines, what's the value, etc. The runner
>>> writers,
>>> or DSL writers will find their resources but not on the homepage (on
>>> dedicated section of the website).
>>>
>>> In term of sections, we could propose
>>> 1.1. Overview (with the three perspective/type of users)
>>> 1.2. Libraries: SDKs, DSLs, IOs, Runners
>>> 1.3. Documentation: Dev Guide, Samples, Runners Writer guide, ...
>>> 1.4. Community: mailing list, contribution guide, ...
>>> 1.5. Apache (link to ASF)
>>>
>>> 2. a look'n feel should be clean and professional, at least for the home
>>> page.
>>>
>>> I would love to help here !
>>>
>>> Regards
>>> JB
>>>
>>>
>>> On 06/06/2016 05:29 PM, James Malone wrote:
>>>
>>> Hello everyone!

 The current design of the Apache Beam website[1] is based on the a basic
 Bootstrap/Jekyll theme. While this made getting an initial site out
 quickly
 pretty easy, the site itself is a little bland (in my opinion :). I
 propose
 we create a new design (layout templates, color schemes, visual design)
 for
 the Beam website.

 Since the website is currently using Bootstrap and Jekyll, this should
 be
 a
 relatively easy process. Getting this done will require a new design and
 some CSS/HTML work. Additionally, before a design is put in place, I
 think
 it makes sense to discuss any ideas about a future design first.

 So, I think there are two open questions behind this proposal:

 1. Is there anyone within the community who would be interested in
 creating
 a design proposal or two and sharing them with the community?
 2. Are there any ideas, opinions, and thoughts around what the design of
 the site *should* be?

 What does everyone think?

 Cheers!

 James

 [1]: http://beam.incubator.apache.org


 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [PROPOSAL] New Beam website design?

2016-06-16 Thread Jean-Baptiste Onofré

Hi Frances,

great doc !

Maybe in the "Learn" section, we can also add IOs (like SDKs, and 
runners), like we do in Camel (http://camel.apache.org/components.html) 
For the SDKs, I would also add DSLs in the same section.


WDYT ?

Regards
JB

On 06/17/2016 12:21 AM, Frances Perry wrote:

Good point, JB -- let's redo the page layout as well.

I started with your proposal and tweaked it a bit to add in more details
and divide things a bit more according to use case (end user vs. runner/sdk
developer):
https://docs.google.com/document/d/1-0jMv7NnYp0Ttt4voulUMwVe_qjBYeNMLm2LusYF3gQ/edit

Let me know what you think, and what part you'd like to drive! I'd suggest
we get the new section layout set this week, so we can parallelize site
design and assorted page content.

On Mon, Jun 6, 2016 at 8:38 AM, Jean-Baptiste Onofré 
wrote:


Hi James,

very good idea !

Couple of month ago, I completely revamped the Karaf website:

http://karaf.apache.org/

It could be a good skeleton in term of sections/pages.

IMHO, for Beam, at least for the home page, we should have:
1. a clear message about what Beam is from an user perspective: why should
I use Beam and write pipelines, what's the value, etc. The runner writers,
or DSL writers will find their resources but not on the homepage (on
dedicated section of the website).

In term of sections, we could propose
1.1. Overview (with the three perspective/type of users)
1.2. Libraries: SDKs, DSLs, IOs, Runners
1.3. Documentation: Dev Guide, Samples, Runners Writer guide, ...
1.4. Community: mailing list, contribution guide, ...
1.5. Apache (link to ASF)

2. a look'n feel should be clean and professional, at least for the home
page.

I would love to help here !

Regards
JB


On 06/06/2016 05:29 PM, James Malone wrote:


Hello everyone!

The current design of the Apache Beam website[1] is based on the a basic
Bootstrap/Jekyll theme. While this made getting an initial site out
quickly
pretty easy, the site itself is a little bland (in my opinion :). I
propose
we create a new design (layout templates, color schemes, visual design)
for
the Beam website.

Since the website is currently using Bootstrap and Jekyll, this should be
a
relatively easy process. Getting this done will require a new design and
some CSS/HTML work. Additionally, before a design is put in place, I think
it makes sense to discuss any ideas about a future design first.

So, I think there are two open questions behind this proposal:

1. Is there anyone within the community who would be interested in
creating
a design proposal or two and sharing them with the community?
2. Are there any ideas, opinions, and thoughts around what the design of
the site *should* be?

What does everyone think?

Cheers!

James

[1]: http://beam.incubator.apache.org



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [thread fork] Apache Beam & Google Cloud Dataflow

2016-06-16 Thread Jean-Baptiste Onofré

Hi,

as soon as you use the Beam dataflow runner, it should work smoothly.

Regards
JB

On 06/16/2016 10:05 PM, Ismaël Mejía wrote:

Hello,

One additional comment / question. I just noticed that Beam users already
can write their Beam Pipelines and execute them in the google dataflow
runner.

I just did the test today and I was thrilled to confirm that it worked (as
JB told me).

You can look at the SDK version in the image:
https://imgur.com/k9HnLnv

The question is, is this some kind of beta, or is this going to be
supported during the transition (before the formal release 1.0) ? I ask
this because I suppose many current google users hesitate to move to Beam
for the moment because they don't know that they can already run their
pipelines in the Google Cloud Dataflow service. I think this is a good idea
to encourage users to move their data processing pipelines into the Beam
version.

Regards,
Ismaël




On Wed, Jun 15, 2016 at 11:21 PM, James Malone <
jamesmal...@google.com.invalid> wrote:


Hi everyone,

This is a thread fork from the email thread titled '[dev] Announcing
0.1.0-incubating release'.

In that thread, Amir posed a good question:

Why is still "Google Cloud Dataflow" included in the Beam release if
Beam is indeed
an evolution (super-set?) of "Google Cloud Dataflow".Thanks
+regards,Amir-

Many parts of Apache Beam are based on work from Google Cloud Dataflow,
including the Dataflow (now Beam) model, SDKs (Java and Python), and some
of the runners. This work was combined with awesome contributions from
other groups (data Artisans/Apache Flink, Cloudera & PayPal/Apache Spark,
etc.) to form the basis for Apache Beam[1]. Originally, the Cloud Dataflow
SDK included machinery so Dataflow pipelines could be executed on Google
Cloud Dataflow.

An important part of Apache Beam is the ability to execute Beam pipelines
on many runners (see the compatibility matrix[2] for full details and
support.) The Beam project includes a runner for Google Cloud Dataflow,
along with others, such as runners for Apache Flink and Apache Spark. We're
also focused (and excited!) to support and grow new runners. As a seperate
runner, the work for supporting execution on Cloud Dataflow can be
separated into the runner from the larger Apache Beam effort.

So, to summarize:

Beam is based on work from Google Cloud Dataflow so it's definitely an
evolution. Additionally, Beam includes a runner (one of many) for Google's
Cloud Dataflow service.

Hope that helps!

James

[1]: http://wiki.apache.org/incubator/BeamProposal
[2]: http://beam.incubator.apache.org/capability-matrix





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] Beam FAQ

2016-06-16 Thread Jean-Baptiste Onofré

+1

Regards
JB

On 06/16/2016 09:41 PM, James Malone wrote:

On Thu, Jun 16, 2016 at 12:37 PM, Ismaël Mejía  wrote:



1. Maybe it is a good idea to put the documentation in an iframe so the
site navigation don't get lost.
2. Can we create a reference link to the latest version of the
documentation ? Something like
https://beam.apache.org/javadoc/latest/
This is easier to refer to the latest version of the doc and it is a common
practice in other projects.



These are both awesome ideas and they should be (I propose) part of the
Beam site redesign. :)



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: newbie question about beam

2016-06-16 Thread Davor Bonaci
We are in process of porting Cloud Dataflow documentation to Beam, so I'll
give you a mix of Dataflow and Beam links.

FilesToStage is a pipeline option [1], [2]. Super-easy to use.
Side inputs are a ParDo concept [3].

If you hit any rough edges, please let us know -- I'd be glad to help!

[1]
https://cloud.google.com/dataflow/pipelines/specifying-exec-params#setting-other-cloud-pipeline-options
[2]
https://beam.incubator.apache.org/javadoc/0.1.0-incubating/org/apache/beam/runners/dataflow/options/DataflowPipelineWorkerPoolOptions.html#getFilesToStage--
[3] https://cloud.google.com/dataflow/model/par-do#side-inputs

On Thu, Jun 16, 2016 at 1:40 AM, Sergio Fernández  wrote:

> Hi Davor,
>
> On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci 
> wrote:
>
> > This is a really good question, Sergio. You got right away to the crux of
> > the problem -- how to express such pattern in the Beam model.
> >
> > The answer depends whether the data is static, e.g., whether it is known
> at
> > pipeline construction time / computed in the earlier stages of the
> > pipeline, or perhaps evolving during pipeline execution. I'll give a
> > high-level answer -- feel free to share more information about your use
> > case and we can drill into specific details.
> >
>
> Well, as a said, for us is more interesting to use Beam in processing time
> that for training purposes. In the past we have experimented a bit with
> approaches like TensorSpark , but
> the critical aspect is exploitation of the models. Therefore we could
> assume the models are static data.
>
>
>
> > In the simplest case, Beam supports "files to stage" concept if the data
> is
> > known apriori. In this case, runners will distribute the data to all
> > workers before computation starts, and your logic can depend on the data
> > being available locally on each worker.
> >
>
> Oh, cool. Something like that would be more than enough for now. Can you
> please point me to any documentation or code I could use to play with it?
>
>
> If this is not sufficient, Beam's side inputs are the right primitive. We
> > support several access patterns for side inputs, including distributed
> > lookup and various types of caching. This can work really well,
> > particularly with a well-optimized runner.
> >
>
> Interesting... any (early) documentation (or code) about such feature?
>
>
>
> > Other alternatives typically include access to a shared storage, which
> is a
> > lower-level approach and often requires more work.
>
>
> Sure, share-storage is always an option, but for many reasons I'd rather
> not resort to such approach.
>
> Thanks so much for all the ideas and valuable discussions!
>
> Cheers,
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>


Re: [PROPOSAL] New Beam website design?

2016-06-16 Thread Frances Perry
Good point, JB -- let's redo the page layout as well.

I started with your proposal and tweaked it a bit to add in more details
and divide things a bit more according to use case (end user vs. runner/sdk
developer):
https://docs.google.com/document/d/1-0jMv7NnYp0Ttt4voulUMwVe_qjBYeNMLm2LusYF3gQ/edit

Let me know what you think, and what part you'd like to drive! I'd suggest
we get the new section layout set this week, so we can parallelize site
design and assorted page content.

On Mon, Jun 6, 2016 at 8:38 AM, Jean-Baptiste Onofré 
wrote:

> Hi James,
>
> very good idea !
>
> Couple of month ago, I completely revamped the Karaf website:
>
> http://karaf.apache.org/
>
> It could be a good skeleton in term of sections/pages.
>
> IMHO, for Beam, at least for the home page, we should have:
> 1. a clear message about what Beam is from an user perspective: why should
> I use Beam and write pipelines, what's the value, etc. The runner writers,
> or DSL writers will find their resources but not on the homepage (on
> dedicated section of the website).
>
> In term of sections, we could propose
> 1.1. Overview (with the three perspective/type of users)
> 1.2. Libraries: SDKs, DSLs, IOs, Runners
> 1.3. Documentation: Dev Guide, Samples, Runners Writer guide, ...
> 1.4. Community: mailing list, contribution guide, ...
> 1.5. Apache (link to ASF)
>
> 2. a look'n feel should be clean and professional, at least for the home
> page.
>
> I would love to help here !
>
> Regards
> JB
>
>
> On 06/06/2016 05:29 PM, James Malone wrote:
>
>> Hello everyone!
>>
>> The current design of the Apache Beam website[1] is based on the a basic
>> Bootstrap/Jekyll theme. While this made getting an initial site out
>> quickly
>> pretty easy, the site itself is a little bland (in my opinion :). I
>> propose
>> we create a new design (layout templates, color schemes, visual design)
>> for
>> the Beam website.
>>
>> Since the website is currently using Bootstrap and Jekyll, this should be
>> a
>> relatively easy process. Getting this done will require a new design and
>> some CSS/HTML work. Additionally, before a design is put in place, I think
>> it makes sense to discuss any ideas about a future design first.
>>
>> So, I think there are two open questions behind this proposal:
>>
>> 1. Is there anyone within the community who would be interested in
>> creating
>> a design proposal or two and sharing them with the community?
>> 2. Are there any ideas, opinions, and thoughts around what the design of
>> the site *should* be?
>>
>> What does everyone think?
>>
>> Cheers!
>>
>> James
>>
>> [1]: http://beam.incubator.apache.org
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>


Re: [DISCUSS] Beam data plane serialization tech

2016-06-16 Thread Kenneth Knowles
(Apologies for the formatting)

On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles  wrote:

> Hello everyone!
>
> We are busily working on a Runner API (for building and transmitting
> pipelines)
> and a Fn API (for invoking user-defined functions found within pipelines)
> as
> outlined in the Beam technical vision [1]. Both of these require a
> language-independent serialization technology for interoperability between
> SDKs
> and runners.
>
> The Fn API includes a high-bandwidth data plane where bundles are
> transmitted
> via some serialization/RPC envelope (inside the envelope, the stream of
> elements is encoded with a coder) to transfer bundles between the runner
> and
> the SDK, so performance is extremely important. There are many choices for
> high
> performance serialization, and we would like to start the conversation
> about
> what serialization technology is best for Beam.
>
> The goal of this discussion is to arrive at consensus on the question:
> What
> serialization technology should we use for the data plane envelope of the
> Fn
> API?
>
> To facilitate community discussion, we looked at the available
> technologies and
> tried to narrow the choices based on three criteria:
>
>  - Performance: What is the size of serialized data? How do we expect the
>technology to affect pipeline speed and cost? etc
>
>  - Language support: Does the technology support the most widespread
> language
>for data processing? Does it have a vibrant ecosystem of contributed
>language bindings? etc
>
>  - Community: What is the adoption of the technology? How mature is it?
> How
>active is development? How is the documentation? etc
>
> Given these criteria, we came up with four technologies that are good
> contenders. All have similar & adequate schema capabilities.
>
>  - Apache Avro: Does not require code gen, but embedding the schema in the
> data
>could be an issue. Very popular.
>
>  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> number of
>language supported.
>
>  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> through
>long-term use of Protocol Buffers.
>
>  - FlatBuffers: Some benchmarks imply great performance from the zero-copy
> mmap
>idea. We would need to run representative experiments.
>
> I want to emphasize that this is a community decision, and this thread is
> just
> the conversation starter for us all to weigh in. We just wanted to do some
> legwork to focus the discussion if we could.
>
> And there's a minor follow-up question: Once we settle here, is that
> technology
> also suitable for the low-bandwidth Runner API for defining pipelines, or
> does
> anyone think we need to consider a second technology (like JSON) for
> usability
> reasons?
>
> [1]
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
>
>


Re: [thread fork] Apache Beam & Google Cloud Dataflow

2016-06-16 Thread Ismaël Mejía
Hello,

One additional comment / question. I just noticed that Beam users already
can write their Beam Pipelines and execute them in the google dataflow
runner.

I just did the test today and I was thrilled to confirm that it worked (as
JB told me).

You can look at the SDK version in the image:
https://imgur.com/k9HnLnv

The question is, is this some kind of beta, or is this going to be
supported during the transition (before the formal release 1.0) ? I ask
this because I suppose many current google users hesitate to move to Beam
for the moment because they don't know that they can already run their
pipelines in the Google Cloud Dataflow service. I think this is a good idea
to encourage users to move their data processing pipelines into the Beam
version.

Regards,
Ismaël




On Wed, Jun 15, 2016 at 11:21 PM, James Malone <
jamesmal...@google.com.invalid> wrote:

> Hi everyone,
>
> This is a thread fork from the email thread titled '[dev] Announcing
> 0.1.0-incubating release'.
>
> In that thread, Amir posed a good question:
>
>Why is still "Google Cloud Dataflow" included in the Beam release if
> Beam is indeed
>an evolution (super-set?) of "Google Cloud Dataflow".Thanks
> +regards,Amir-
>
> Many parts of Apache Beam are based on work from Google Cloud Dataflow,
> including the Dataflow (now Beam) model, SDKs (Java and Python), and some
> of the runners. This work was combined with awesome contributions from
> other groups (data Artisans/Apache Flink, Cloudera & PayPal/Apache Spark,
> etc.) to form the basis for Apache Beam[1]. Originally, the Cloud Dataflow
> SDK included machinery so Dataflow pipelines could be executed on Google
> Cloud Dataflow.
>
> An important part of Apache Beam is the ability to execute Beam pipelines
> on many runners (see the compatibility matrix[2] for full details and
> support.) The Beam project includes a runner for Google Cloud Dataflow,
> along with others, such as runners for Apache Flink and Apache Spark. We're
> also focused (and excited!) to support and grow new runners. As a seperate
> runner, the work for supporting execution on Cloud Dataflow can be
> separated into the runner from the larger Apache Beam effort.
>
> So, to summarize:
>
> Beam is based on work from Google Cloud Dataflow so it's definitely an
> evolution. Additionally, Beam includes a runner (one of many) for Google's
> Cloud Dataflow service.
>
> Hope that helps!
>
> James
>
> [1]: http://wiki.apache.org/incubator/BeamProposal
> [2]: http://beam.incubator.apache.org/capability-matrix
>


Re: [PROPOSAL] Beam FAQ

2016-06-16 Thread James Malone
On Thu, Jun 16, 2016 at 12:37 PM, Ismaël Mejía  wrote:

>
> 1. Maybe it is a good idea to put the documentation in an iframe so the
> site navigation don't get lost.
> 2. Can we create a reference link to the latest version of the
> documentation ? Something like
> https://beam.apache.org/javadoc/latest/
> This is easier to refer to the latest version of the doc and it is a common
> practice in other projects.
>

These are both awesome ideas and they should be (I propose) part of the
Beam site redesign. :)


Re: [PROPOSAL] Beam FAQ

2016-06-16 Thread Ismaël Mejía
Hello, just for reference, in another thread it was mentioned that the Beam
FAQ idea already had a JIRA.

https://issues.apache.org/jira/browse/BEAM-161

I just saw that the javadoc is now online. Excellent ! You can find it in
the menu:
Technical Documentation -> API Reference

https://beam.apache.org/javadoc/0.1.0-incubating/

Davor (or the others), just two ideas:

1. Maybe it is a good idea to put the documentation in an iframe so the
site navigation don't get lost.
2. Can we create a reference link to the latest version of the
documentation ? Something like
https://beam.apache.org/javadoc/latest/
This is easier to refer to the latest version of the doc and it is a common
practice in other projects.

Regards,
Ismael


On Wed, Jun 1, 2016 at 2:16 AM, Davor Bonaci 
wrote:

> Javadoc publication should be a part of every release. As soon as the first
> release is complete, Javadoc will be on our website.
>
> On Sun, May 29, 2016 at 10:35 PM, Jean-Baptiste Onofré 
> wrote:
>
> > Thanks Devin,
> >
> > gonna take a look !
> >
> > Regards
> > JB
> >
> >
> > On 05/28/2016 02:20 AM, Devin Donnelly wrote:
> >
> >> The relevant file you're looking for, and the one that's constantly
> >> updated, is:
> >>
> >> /docs/programming-guide.md
> >>
> >> On Fri, May 27, 2016 at 5:20 PM, Devin Donnelly 
> >> wrote:
> >>
> >> Here's the URL of my fork, so you can see what it looks like so far:
> >>>
> >>> https://github.com/devin-donnelly/incubator-beam-site/tree/beam-pg
> >>>
> >>> On Mon, May 23, 2016 at 8:02 AM, Jean-Baptiste Onofré  >
> >>> wrote:
> >>>
> >>> Agree, it would be great to have such user guide + a started guide for
>  Beam.
> 
>  Regards
>  JB
> 
> 
>  On 05/23/2016 04:41 PM, Jesse Anderson wrote:
> 
>  I think Josh's Crunch User Guide is a great example of what a user
> guide
> > should cover. https://crunch.apache.org/user-guide.html
> >
> > On Mon, May 23, 2016 at 2:00 AM Ismaël Mejía 
> > wrote:
> >
> > Ok, I agree Davor for end users a getting started guide is not only
> >
> >> important but I would say critical at this moment, the FAQ can be an
> >> effort
> >> run in parallel. The project is incubating so the FAQ would be in
> its
> >> early
> >> state, and ideally we must not need an enormous FAQ, however this
> >> project
> >> mixes many different technologies, and I can easily imagine frequent
> >> questions about technical details on Sources, Sinks, and Runners
> e.g.
> >> my
> >> question on how to reuse the context on the spark runner is a good
> >> example,
> >> it is not general enough to put it as a default in the runner, it is
> >> not
> >> simple enough for a getting started guide, but a good amount of
> users
> >> will
> >> have to deal with it once they write tests for their pipelines.
> >>
> >> Devin, thanks for writing, I am interested in the draft, can you
> >> please
> >> share the URL of your fork, so other people can eventually take a
> >> look/contribute.
> >>
> >> Ismael
> >>
> >>
> >>
> >> On Fri, May 20, 2016 at 7:35 PM, Devin Donnelly <
> >> ddonne...@google.com.invalid> wrote:
> >>
> >> FYI: User documentation draft (the Beam Programming Guide) is well
> >>
> >>> underway. I'm regularly pushing stuff out to a fork of the Beam
> >>> website
> >>> repo if anyone wants a sneak peek.
> >>> On May 20, 2016 9:37 AM, "Davor Bonaci" 
> >>>
> >>> wrote:
> >>
> >>
> >>> We are missing a basic getting started guide along with the rest of
> >>>
> 
>  user
> >>>
> >>
> >> documentation. I think we should work on this first.
> >>>
> 
>  FAQ is a great idea for things that aren't or cannot be covered by
> 
>  those
> >>>
> >>
> >> documents -- but, we cannot really start that before we have at
> least
> >>> a
> >>>
>  draft version of the previous.
> 
>  Wiki hosting would be owned by Infra, if we choose to go down that
>  path
> 
>  at
> >>>
> >>> some point.
> 
>  On Fri, May 20, 2016 at 1:34 AM, Jean-Baptiste Onofré <
>  j...@nanthrax.net
> 
> 
> >>> wrote:
> >>>
> 
>  Hi,
> 
> >
> > good idea for the FAQ. Not sure for the wiki: it would prefer
> kind
> > of
> > governance and review using the website.
> >
> > Regards
> > JB
> >
> >
> > On 05/20/2016 09:24 AM, Ismaël Mejía wrote:
> >
> > Hello,
> >
> >>
> >> I have stumbled with some issues while trying to execute
> pipelines
> >>
> >> with
> >
> 
> >>> all
> 
> > the
> >> different runners and I was wondering if we need to crea

[DISCUSS] Beam data plane serialization tech

2016-06-16 Thread Kenneth Knowles
Hello everyone!

We are busily working on a Runner API (for building and transmitting
pipelines)
and a Fn API (for invoking user-defined functions found within pipelines) as
outlined in the Beam technical vision [1]. Both of these require a
language-independent serialization technology for interoperability between
SDKs
and runners.

The Fn API includes a high-bandwidth data plane where bundles are
transmitted
via some serialization/RPC envelope (inside the envelope, the stream of
elements is encoded with a coder) to transfer bundles between the runner and
the SDK, so performance is extremely important. There are many choices for
high
performance serialization, and we would like to start the conversation about
what serialization technology is best for Beam.

The goal of this discussion is to arrive at consensus on the question: What
serialization technology should we use for the data plane envelope of the Fn
API?

To facilitate community discussion, we looked at the available technologies
and
tried to narrow the choices based on three criteria:

 - Performance: What is the size of serialized data? How do we expect the
   technology to affect pipeline speed and cost? etc

 - Language support: Does the technology support the most widespread
language
   for data processing? Does it have a vibrant ecosystem of contributed
   language bindings? etc

 - Community: What is the adoption of the technology? How mature is it? How
   active is development? How is the documentation? etc

Given these criteria, we came up with four technologies that are good
contenders. All have similar & adequate schema capabilities.

 - Apache Avro: Does not require code gen, but embedding the schema in the
data
   could be an issue. Very popular.

 - Apache Thrift: Probably a bit faster and compact than Avro. A huge
number of
   language supported.

 - Protocol Buffers 3: Incorporates the lessons that Google has learned
through
   long-term use of Protocol Buffers.

 - FlatBuffers: Some benchmarks imply great performance from the zero-copy
mmap
   idea. We would need to run representative experiments.

I want to emphasize that this is a community decision, and this thread is
just
the conversation starter for us all to weigh in. We just wanted to do some
legwork to focus the discussion if we could.

And there's a minor follow-up question: Once we settle here, is that
technology
also suitable for the low-bandwidth Runner API for defining pipelines, or
does
anyone think we need to consider a second technology (like JSON) for
usability
reasons?

[1]
https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38


Re: newbie question about beam

2016-06-16 Thread Sergio Fernández
Hi Davor,

On Thu, Jun 16, 2016 at 3:04 AM, Davor Bonaci 
wrote:

> This is a really good question, Sergio. You got right away to the crux of
> the problem -- how to express such pattern in the Beam model.
>
> The answer depends whether the data is static, e.g., whether it is known at
> pipeline construction time / computed in the earlier stages of the
> pipeline, or perhaps evolving during pipeline execution. I'll give a
> high-level answer -- feel free to share more information about your use
> case and we can drill into specific details.
>

Well, as a said, for us is more interesting to use Beam in processing time
that for training purposes. In the past we have experimented a bit with
approaches like TensorSpark , but
the critical aspect is exploitation of the models. Therefore we could
assume the models are static data.



> In the simplest case, Beam supports "files to stage" concept if the data is
> known apriori. In this case, runners will distribute the data to all
> workers before computation starts, and your logic can depend on the data
> being available locally on each worker.
>

Oh, cool. Something like that would be more than enough for now. Can you
please point me to any documentation or code I could use to play with it?


If this is not sufficient, Beam's side inputs are the right primitive. We
> support several access patterns for side inputs, including distributed
> lookup and various types of caching. This can work really well,
> particularly with a well-optimized runner.
>

Interesting... any (early) documentation (or code) about such feature?



> Other alternatives typically include access to a shared storage, which is a
> lower-level approach and often requires more work.


Sure, share-storage is always an option, but for many reasons I'd rather
not resort to such approach.

Thanks so much for all the ideas and valuable discussions!

Cheers,

-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Re: newbie question about beam

2016-06-16 Thread Sergio Fernández
On Wed, Jun 15, 2016 at 11:18 AM, Jean-Baptiste Onofré 
wrote:

> Not the Beam Model for sure (the Beam Model is about the pipeline design).
>
> The Beam Runner API can help there, but the final implement is on the
> runner itself.
>

Right. I'll take a look to the Beam Runner API documentation and experiment
a bit with it. Thanks!





On 06/15/2016 10:18 AM, Sergio Fernández wrote:
>
>> Hi Jean-Baptiste,
>>
>> On Tue, Jun 14, 2016 at 12:45 PM, Jean-Baptiste Onofré 
>> wrote:
>>
>>>
>>> Welcome aboard, and good to discuss with you during ApacheCon.
>>>
>>>
>> Was nice to put you all faces ;-)
>>
>>
>> Distribution of the resources is a point related to runner, and more
>>> specifically to the execution environment of the runner. Each
>>> runner/backend will implement their own logic.
>>>
>>>
>> Yes, I can understand. But I wonder if the Beam Model provides any
>> primitive to deal with such aspects in an abstract way. I guess I'd need
>> to
>> go deeper into Beam to approach you with more concrete questions; so for
>> now it's fine.
>>
>> Regarding the Python SDK, we discussed about that last week: it's on the
>>
>>> way. We should have the Python SDK very soon (we were busy with the first
>>> release).
>>>
>>
>>
>> Yep, I knew that was the plan. It's really cool to have it already is
>> master to the next release :-)
>>
>> Thanks.
>>
>>
>>
>>
>>
>>> On 06/14/2016 12:38 PM, Sergio Fernández wrote:
>>>
>>> Hi guys,

 I'm newbie in the Beam community, but as someone who has used DataFlow
 in
 the past I've been following the podling since you came to ASK. I'm very
 happy to see that 0.1.0-incubating is finally going out, congratulations
 for such great milestone.

 I discussed with some of you guys in the last ApacheCon, and for me was
 good to know the Python SDK was just a matter of time and should come to
 Beam at some point. So coming back to the original plans <


 http://beam.incubator.apache.org/beam/python/sdk/2016/02/25/python-sdk-now-public.html

> ,
>
 do you manage any timeline to bring the Python SDK to Beam?

 So I'd like to bring a question how Beam plans to deal with the
 distribution of resources across all nodes, something I know it not
 really
 clean with some runners (e.g., Spark). More concretely, we're using
 Keras
 <
 http://keras.io/>, a deep learning Python library that is capable of
 running on top of either TensorFlow or Theano. Historically I know
 DataFlow
 and TensorFlow are not very compatible. But I wonder if the project has
 already discussed how to support running Keras (TensorFlow) tasks on
 Beam.
 For us is more for querying than for training, so I'd like to know if
 the
 Beam Model could natively support the distribution of the models
 (sometimes
 several GB).

 Thanks in advance.

 Cheers,


 --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>>
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>



-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co