[VOTE] Apache Beam release 0.4.0-incubating

2016-12-20 Thread Jean-Baptiste Onofré

Hi everyone,

Please review and vote on the release candidate #3 for the Apache Beam, 
version 0.4.0-incubating, as follows:

[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC3" [4],
* website pull request listing the release and publishing the API 
reference manual [5].


The Apache Beam community has unanimously approved this release [6].

As customary, the vote will be open for at least 72 hours. It is adopted 
by a majority approval with at least three PMC affirmative votes. If 
approved, we will proceed with the release.


Two questions are likely to be asked, so I thought I’d provide comments 
right away:
* While we are currently in graduation discussions, I think it makes 
sense to make this incubating release anyway. It significantly improves 
getting started experience for our users, and we’d like to have it 
released by the time any potential announcement is made.
* I had experienced networking connectivity issue reaching 
repository.apache.org, which is further described and tracked in 
INFRA-13086. I’ve received a tiny bit of help to work around this issue 
from a previous release manager, which is visible if you carefully 
examine signatures.


Thanks!

JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590
[2]https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3]
https://repository.apache.org/content/repositories/orgapachebeam-1008/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=112e38e4a68b07e6bf4916d1bdcc7ecaca8bbbd4
[5]https://github.com/apache/incubator-beam-site/pull/109
[6]https://lists.apache.org/thread.html/1408f5cd58d139ddc59dad6ed2fb94bbfb3743d33db8132ab76cd718@%3Cdev.beam.apache.org%3E




[RESULT][VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-20 Thread Jean-Baptiste Onofré

Hi,

I'm happy to announce that we have unanimously approved this release.

There are 8 approving votes, 5 of which are binding:
* Kenneth Knowles
* Davor Bonaci
* Jean-Baptiste Onofré
* Dan Halperin
* Aljoscha Krettek

There are no disapproving votes.

Thanks everyone!

Regards
JB

On 12/16/2016 02:06 PM, Jean-Baptiste Onofré wrote:

Hi everyone,

Please review and vote on the release candidate #3 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC3" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590

[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3] https://repository.apache.org/content/repositories/orgapachebeam-1008/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=112e38e4a68b07e6bf4916d1bdcc7ecaca8bbbd4

[5] https://github.com/apache/incubator-beam-site/pull/109


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-20 Thread Jean-Baptiste Onofré

Thanks for the update Dan.

I think we should move forward on this release as, as you said, we have 
important improvements compared to 0.3.0-incubating release.
We can do a 0.4.1-incubating pretty soon to address the bigquery IO 
issues. I'm volunteer to do that.


Regards
JB

On 12/19/2016 09:21 PM, Dan Halperin wrote:

I vetted the binary artifacts accompanying the release by running several
jobs on the Dataflow and Direct runners. At a high level, the release looks
fine -- I ran some of my favorite jobs and they all worked swimmingly.

There are some severe bugs in BigQueryIO in the release. Specifically, we
broke the ability to write to BigQuery using different tables for every
window. To a large degree, this makes BigQuery useless when working with
unbounded data (streaming pipelines). The bugs have been fixed (and
accompanying tests added) in PRs #1651 and #1400.

Conclusion: +0.8

* 0.4.0-incubating RC3 is largely an improvement over 0.3.0-incubating,
especially in the user getting started experience.
* The bugs in BigQueryIO are blockers for BigQuery users, but this is
likely a relatively small fraction of the Beam community. I would not
retract RC3 based on this alone. Unless we plan to cut an RC4 for other
reasons, we should move forward with RC3.

I'd hope that we hear from key users of the Apex, Flink, and Spark runners
before closing the vote, even though it's technically been 72+ hours. I
suggest we wait to ensure they have an opportunity to chime in.

Thanks,
Dan


Appendix: pom.xml changes to use binary releases from Apache Staging:

  

  apache.staging
  Apache Development Staging Repository
  https://repository.apache.org/content/repositories/staging/
  
true
  
  
false
  

  

On Sun, Dec 18, 2016 at 10:14 PM, Jean-Baptiste Onofré 
wrote:


Hi guys,

The good thing is that my issue to access repository.apache.org Nexus is
now fixed.

To update the signature files, we have to drop the Nexus repository to
stage a new one,
meaning cancel the current vote to do a new RC4.

I can do that, up to you.

Anyway, regarding the release content, +1 (binding).

Regards
JB


On 12/18/2016 06:56 PM, Davor Bonaci wrote:


Indeed -- I did help JB with the release ever so slightly, due to the
networking connectivity issue reaching repository.apache.org, which JB
further described and is tracked in INFRA-13086 [1]. This is not
Beam-specific.

The current signature shouldn't be a problem at all, but, since others are
asking about it, I think it would be the best to simply re-sign the source
.zip archive and continuing this vote. JB, what do you think?

Regarding the release itself, I think we need to keep raising the quality
and maturity release-over-release, and test signals are an excellent way
to
demonstrate that. Due to the recent upgrades to Jenkins, usage of the DSL,
etc. (thanks INFRA and Jason Kuster), we can now, for the first time,
formally show that the release candidate clearly passes all Jenkins suites
that we have:
* All unit tests across the project, plus example ITs across all runners
[2], [3].
* All integration tests on the Apex runner [4].
* All integration tests on the Flink runner [5].
* All integration tests on the Spark runner [6].
* All integration tests on the Dataflow runner [7].

That said, I know of a few issues/regressions in the areas that are not
well tested today. I think Dan Halperin has more context, so I'll let him
speak of the details, and quote relevant JIRA issues.

With the known issues in 0.3.0-incubating, such as trouble running
examples
out-of-the-box, I think this release candidate is a clear win. Of course,
that may change if more issues are discovered.

For me, this release candidate is +1 (at this time), contingent upon no
known major issues affecting Apex, Flink and Spark runners.

Davor

[1] https://issues.apache.org/jira/browse/INFRA-13086
[2]
https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_
MavenInstall/5994/
[3]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
_MavenInstall/2116/
[4]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
_RunnableOnService_Apex/10/
[5]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
_RunnableOnService_Flink/1120/
[6]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
_RunnableOnService_Spark/430/
[7]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java
_RunnableOnService_Dataflow/1830/


On Sat, Dec 17, 2016 at 4:13 PM, Kenneth Knowles 
wrote:

+1, as long as it is fine for the release to be signed by a PMC member

other than the release manager. Otherwise need to replace the .asc file.

Following [Apache release checklist](
http://incubator.apache.org/guides/releasemanagement.html#check-list):

1.1 Verified checksums & signature (Davor's)
2.1 Ran unit tests and integration tests
3.1 DISCLAIMER is correct
3.2 LICENSE & NOTICE are correct
3.3 Files have license header

Re: [VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-18 Thread Jean-Baptiste Onofré

Hi guys,

The good thing is that my issue to access repository.apache.org Nexus is now 
fixed.

To update the signature files, we have to drop the Nexus repository to stage a 
new one,
meaning cancel the current vote to do a new RC4.

I can do that, up to you.

Anyway, regarding the release content, +1 (binding).

Regards
JB

On 12/18/2016 06:56 PM, Davor Bonaci wrote:

Indeed -- I did help JB with the release ever so slightly, due to the
networking connectivity issue reaching repository.apache.org, which JB
further described and is tracked in INFRA-13086 [1]. This is not
Beam-specific.

The current signature shouldn't be a problem at all, but, since others are
asking about it, I think it would be the best to simply re-sign the source
.zip archive and continuing this vote. JB, what do you think?

Regarding the release itself, I think we need to keep raising the quality
and maturity release-over-release, and test signals are an excellent way to
demonstrate that. Due to the recent upgrades to Jenkins, usage of the DSL,
etc. (thanks INFRA and Jason Kuster), we can now, for the first time,
formally show that the release candidate clearly passes all Jenkins suites
that we have:
* All unit tests across the project, plus example ITs across all runners
[2], [3].
* All integration tests on the Apex runner [4].
* All integration tests on the Flink runner [5].
* All integration tests on the Spark runner [6].
* All integration tests on the Dataflow runner [7].

That said, I know of a few issues/regressions in the areas that are not
well tested today. I think Dan Halperin has more context, so I'll let him
speak of the details, and quote relevant JIRA issues.

With the known issues in 0.3.0-incubating, such as trouble running examples
out-of-the-box, I think this release candidate is a clear win. Of course,
that may change if more issues are discovered.

For me, this release candidate is +1 (at this time), contingent upon no
known major issues affecting Apex, Flink and Spark runners.

Davor

[1] https://issues.apache.org/jira/browse/INFRA-13086
[2]
https://builds.apache.org/view/Beam/job/beam_PreCommit_Java_MavenInstall/5994/
[3]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_MavenInstall/2116/
[4]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Apex/10/
[5]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Flink/1120/
[6]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Spark/430/
[7]
https://builds.apache.org/view/Beam/job/beam_PostCommit_Java_RunnableOnService_Dataflow/1830/


On Sat, Dec 17, 2016 at 4:13 PM, Kenneth Knowles 
wrote:


+1, as long as it is fine for the release to be signed by a PMC member
other than the release manager. Otherwise need to replace the .asc file.

Following [Apache release checklist](
http://incubator.apache.org/guides/releasemanagement.html#check-list):

1.1 Verified checksums & signature (Davor's)
2.1 Ran unit tests and integration tests
3.1 DISCLAIMER is correct
3.2 LICENSE & NOTICE are correct
3.3 Files have license headers (RAT & checkstyle)
3.4 Provenance is clear
3.5 Dependencies license are legal (RAT) [2]
3.6 Release contains source code, no binaries

Additionally:

 - Went over the generated javadoc (filed tickets but no release blockers)
 - Went over the generated release notes
 - Sanity checked the Maven Central artifacts
 - Confirmed that the git tag matches
 - Checked the website PR

I heartily agree that the components would give much better context on
tickets. Even with that, our JIRA titles could use a lot of improvement.


On Fri, Dec 16, 2016 at 5:06 AM, Jean-Baptiste Onofré 
wrote:


Hi everyone,

Please review and vote on the release candidate #3 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org
[2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC3" [4],
* website pull request listing the release and publishing the API

reference

manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1] https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje
ctId=12319527&version=12338590
[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-
incubating/
[3] https://repository.apache.org/content/repositories/orgapache

beam-1008/

[4] https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git
;a=tag;h=112e38e4a68b07e6bf4916d1bdcc7ecaca8bbbd4
[5] https://github.com/apache/incubator-beam-site/pull/109







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [component] tag in JIRA tickets

2016-12-16 Thread Jean-Baptiste Onofré

Let me check. In any case, I can grant you the required permission.

Regards
JB

On 12/16/2016 03:59 PM, Amit Sela wrote:

Yup, looks like a custom velocity schema is the way to go here.
@JB would I require any additional permissions for this ? I don't mind
playing around with this.

On Thu, Dec 15, 2016 at 10:13 PM Jean-Baptiste Onofré 
wrote:


+1 Dan, it's exactly what I had in mind. I would be great to e
experiment a bit around that.

Regards
JB

On 12/15/2016 05:35 PM, Dan Halperin wrote:

Amit, I think you bring up a wonderful point. Release notes are hard to
grok right now.

I wonder if we can expose the component name (which issues are already
tagged with) in a custom release notes template? https://developer.
atlassian.com/jiradev/jira-platform/jira-architecture/


jira-templates-and-jsps/creating-a-custom-release-notes-template-containing-

release-comments

On Thu, Dec 15, 2016 at 6:15 AM, Jean-Baptiste Onofré 
wrote:


Yes, agree. We had kind of similar discussion while ago:
"java-sdk-extension" vs "io" afair ;)

Regards
JB


On 12/15/2016 03:12 PM, Amit Sela wrote:


If we can do this via JIRA component, even better, but then we would

need

to work on components.

On Thu, Dec 15, 2016 at 3:45 PM Jean-Baptiste Onofré 
wrote:

Hi Amit,


interesting idea, even if it's redundant with the Jira component.
However, the Jira component is also generic (for instance,
java-sdk-extension is for both extensions and IOs).

If possible, I would more work on the component (and customize the
Release Notes output to include it).

Regards
JB

On 12/15/2016 02:30 PM, Amit Sela wrote:


I took a look at the release notes for 0.4.0-incubating now and I

felt



like


it could have been "tagged" in a way that helps people focus on

what's

interesting to them
Currently, all resolved issues simply appear as they are in JIRA,

but we

don't have any way to tag them.

What if we were to prefix the issue title with the component,

examples:

[runners-spark] fixed-some Spark runner issue
[SDK] added state and timers API
[IO] added HBaseIO support
...

This would be more readable, and allow users to focus by looking for


what's


interesting to them in a release (CTRL/CMD + F in the browser..)

Thoughts ?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[VOTE] Release 0.4.0-incubating, release candidate #3

2016-12-16 Thread Jean-Baptiste Onofré

Hi everyone,

Please review and vote on the release candidate #3 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC3" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590
[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3] https://repository.apache.org/content/repositories/orgapachebeam-1008/
[4] 
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=112e38e4a68b07e6bf4916d1bdcc7ecaca8bbbd4
[5] https://github.com/apache/incubator-beam-site/pull/109


[CANCEL][VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré

Hi guys,

regarding the issues found (and actually already fixed on the release 
branch), I cancel this vote.


I will submit a RC2 soon.

Thanks,
Regards
JB

On 12/15/2016 01:46 PM, Jean-Baptiste Onofré wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590

[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3] https://repository.apache.org/content/repositories/orgapachebeam-1006/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=85d1c8a2f85bbc667c90f55ff0eb27de5c2446a6

[5] https://github.com/apache/incubator-beam-site/pull/109


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré

That's the plan, I will cancel the RC#1 vote and start a RC#2.

Regards
JB

On 12/15/2016 09:27 PM, Dan Halperin wrote:

I think JB and Davor are right, have heard other supporting votes, and I
haven't heard any specific disagreement.

My understanding of rules/procedure is that JB as release manager is free
to cancel the vote right now and begin RC2 when he is ready.

On Thu, Dec 15, 2016 at 11:36 AM, Jean-Baptiste Onofré 
wrote:


-1 (binding)

I would include the fix for metrics and impacting dataflow and flink
runners.

I agree with Davor: I would prefer to cut a RC2.

Regards
JB⁣​

On Dec 15, 2016, 20:06, at 20:06, Kenneth Knowles 
wrote:

Agreed. I had though the issue in PR #1620 only affected Dataflow (in
which
case we could address it in the service) but it now also affects the
Flink
runner, so it should be included in the release.

On Thu, Dec 15, 2016 at 10:46 AM, Eugene Kirpichov <
kirpic...@google.com.invalid> wrote:


There is one more data-loss type error, a fix for which should go

into the

release.
https://github.com/apache/incubator-beam/pull/1620

On Thu, Dec 15, 2016 at 10:42 AM Davor Bonaci 

wrote:



I think we should build another RC.

Two issues:
* Metrics issue that JB pointed out earlier. It seems to cause a

somewhat

poor user experience for every pipeline executed on the Direct

runner.

(Thanks JB for finding this out!)
* Failure of testSideInputsWithMultipleWindows in Jenkins [1].

Both issues seem easy, trivial, non-risky fixes that are already

committed

to master. I'd suggest just taking them.

Davor

[1]

https://builds.apache.org/view/Beam/job/beam_PostCommit_

Java_RunnableOnService_Dataflow/1819/


On Thu, Dec 15, 2016 at 8:45 AM, Ismaël Mejía 

wrote:



+1 (non-binding)

- verified signatures + checksums
- run mvn clean verify -Prelease, all artifacts+tests run

smoothly


The release artifacts are signed with the key with fingerprint

8F0D334F

https://dist.apache.org/repos/dist/release/incubator/beam/KEYS

I just created a JIRA to add the signer/KEYS information in the

release

template, I will do a PR for this later on.

Ismaël

On Thu, Dec 15, 2016 at 2:26 PM, Jean-Baptiste Onofré




wrote:


Hi Amit,

thanks for the update.

As you changed the Jira, the Release Notes are now up to date.

Regards
JB


On 12/15/2016 02:20 PM, Amit Sela wrote:


I see three problems in the release notes (related to Spark

runner):


Improvement:

[BEAM-757] - The SparkRunner should utilize the SDK's

DoFnRunner

instead

of
writing it's own.

[BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn

[BEAM-855] - Remove the need for --streaming option in the

spark

runner


BEAM-855 is duplicate and probably shouldn't have had a Fix

Version.


The other two are not a part of this release - I was probably

too

eager

to

mark them fixed after merge and I accidentally put 0.4.0 as

the Fix

Version.

I made the changes in JIRA now.

Thanks,
Amit

On Thu, Dec 15, 2016 at 3:09 PM Jean-Baptiste Onofré <

j...@nanthrax.net



wrote:

Reviewing and testing the release, I see:


16/12/15 14:04:47 ERROR MetricsContainer: Unable to update

metrics

on

the current thread. Most likely caused by using metrics

outside the

managed work-execution thread.

It doesn't block the execution of the pipeline, but

basically, it

means

that metrics don't work anymore.

I'm investigating.

Regards
JB

On 12/15/2016 01:46 PM, Jean-Baptiste Onofré wrote:


Hi everyone,

Please review and vote on the release candidate #1 for the

version

0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific

comments)


The complete staging area is available for your review,

which

includes:

* JIRA release notes [1],
* the official Apache source release to be deployed to

dist.apache.org



[2],


* all artifacts to be deployed to the Maven Central

Repository

[3],

* source code tag "v0.4.0-incubating-RC1" [4],
* website pull request listing the release and publishing

the API



reference


manual [5].

The vote will be open for at least 72 hours. It is adopted

by

majority

approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1]

https://issues.apache.org/jira/secure/ReleaseNote.jspa?proje

ctId=12319527&version=12338590



[2]


https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-

incubating/



[3]






https://repository.apache.org/content/repositories/orgapachebeam-1006/



[4]

https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git

;a=tag;h=85d1c8a2f85bbc667c90f55ff0eb27de5c2446a6



[5] https://github.com/apache/incubator-beam-site/pull/109



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com













--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [component] tag in JIRA tickets

2016-12-15 Thread Jean-Baptiste Onofré
+1 Dan, it's exactly what I had in mind. I would be great to e 
experiment a bit around that.


Regards
JB

On 12/15/2016 05:35 PM, Dan Halperin wrote:

Amit, I think you bring up a wonderful point. Release notes are hard to
grok right now.

I wonder if we can expose the component name (which issues are already
tagged with) in a custom release notes template? https://developer.
atlassian.com/jiradev/jira-platform/jira-architecture/
jira-templates-and-jsps/creating-a-custom-release-notes-template-containing-
release-comments

On Thu, Dec 15, 2016 at 6:15 AM, Jean-Baptiste Onofré 
wrote:


Yes, agree. We had kind of similar discussion while ago:
"java-sdk-extension" vs "io" afair ;)

Regards
JB


On 12/15/2016 03:12 PM, Amit Sela wrote:


If we can do this via JIRA component, even better, but then we would need
to work on components.

On Thu, Dec 15, 2016 at 3:45 PM Jean-Baptiste Onofré 
wrote:

Hi Amit,


interesting idea, even if it's redundant with the Jira component.
However, the Jira component is also generic (for instance,
java-sdk-extension is for both extensions and IOs).

If possible, I would more work on the component (and customize the
Release Notes output to include it).

Regards
JB

On 12/15/2016 02:30 PM, Amit Sela wrote:


I took a look at the release notes for 0.4.0-incubating now and I felt


like


it could have been "tagged" in a way that helps people focus on what's
interesting to them
Currently, all resolved issues simply appear as they are in JIRA, but we
don't have any way to tag them.

What if we were to prefix the issue title with the component, examples:
[runners-spark] fixed-some Spark runner issue
[SDK] added state and timers API
[IO] added HBaseIO support
...

This would be more readable, and allow users to focus by looking for


what's


interesting to them in a release (CTRL/CMD + F in the browser..)

Thoughts ?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré
-1 (binding)

I would include the fix for metrics and impacting dataflow and flink runners.

I agree with Davor: I would prefer to cut a RC2.

Regards
JB⁣​

On Dec 15, 2016, 20:06, at 20:06, Kenneth Knowles  
wrote:
>Agreed. I had though the issue in PR #1620 only affected Dataflow (in
>which
>case we could address it in the service) but it now also affects the
>Flink
>runner, so it should be included in the release.
>
>On Thu, Dec 15, 2016 at 10:46 AM, Eugene Kirpichov <
>kirpic...@google.com.invalid> wrote:
>
>> There is one more data-loss type error, a fix for which should go
>into the
>> release.
>> https://github.com/apache/incubator-beam/pull/1620
>>
>> On Thu, Dec 15, 2016 at 10:42 AM Davor Bonaci 
>wrote:
>>
>> > I think we should build another RC.
>> >
>> > Two issues:
>> > * Metrics issue that JB pointed out earlier. It seems to cause a
>somewhat
>> > poor user experience for every pipeline executed on the Direct
>runner.
>> > (Thanks JB for finding this out!)
>> > * Failure of testSideInputsWithMultipleWindows in Jenkins [1].
>> >
>> > Both issues seem easy, trivial, non-risky fixes that are already
>> committed
>> > to master. I'd suggest just taking them.
>> >
>> > Davor
>> >
>> > [1]
>> >
>> > https://builds.apache.org/view/Beam/job/beam_PostCommit_
>> Java_RunnableOnService_Dataflow/1819/
>> >
>> > On Thu, Dec 15, 2016 at 8:45 AM, Ismaël Mejía 
>wrote:
>> >
>> > > +1 (non-binding)
>> > >
>> > > - verified signatures + checksums
>> > > - run mvn clean verify -Prelease, all artifacts+tests run
>smoothly
>> > >
>> > > The release artifacts are signed with the key with fingerprint
>8F0D334F
>> > > https://dist.apache.org/repos/dist/release/incubator/beam/KEYS
>> > >
>> > > I just created a JIRA to add the signer/KEYS information in the
>release
>> > > template, I will do a PR for this later on.
>> > >
>> > > Ismaël
>> > >
>> > > On Thu, Dec 15, 2016 at 2:26 PM, Jean-Baptiste Onofré
>> >
>> > > wrote:
>> > >
>> > > > Hi Amit,
>> > > >
>> > > > thanks for the update.
>> > > >
>> > > > As you changed the Jira, the Release Notes are now up to date.
>> > > >
>> > > > Regards
>> > > > JB
>> > > >
>> > > >
>> > > > On 12/15/2016 02:20 PM, Amit Sela wrote:
>> > > >
>> > > >> I see three problems in the release notes (related to Spark
>runner):
>> > > >>
>> > > >> Improvement:
>> > > >> 
>> > > >> [BEAM-757] - The SparkRunner should utilize the SDK's
>DoFnRunner
>> > instead
>> > > >> of
>> > > >> writing it's own.
>> > > >> 
>> > > >> [BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn
>> > > >> 
>> > > >> [BEAM-855] - Remove the need for --streaming option in the
>spark
>> > runner
>> > > >>
>> > > >> BEAM-855 is duplicate and probably shouldn't have had a Fix
>Version.
>> > > >>
>> > > >> The other two are not a part of this release - I was probably
>too
>> > eager
>> > > to
>> > > >> mark them fixed after merge and I accidentally put 0.4.0 as
>the Fix
>> > > >> Version.
>> > > >>
>> > > >> I made the changes in JIRA now.
>> > > >>
>> > > >> Thanks,
>> > > >> Amit
>> > > >>
>> > > >> On Thu, Dec 15, 2016 at 3:09 PM Jean-Baptiste Onofré <
>> j...@nanthrax.net
>> > >
>> > > >> wrote:
>> > > >>
>> > > >> Reviewing and testing the release, I see:
>> > > >>>
>> > > >>> 16/12/15 14:04:47 ERROR MetricsContainer: Unable to update
>metrics
>> on
>> > > >>> the current thread. Most likely caused by using metrics
>outside the
>> > > >>> managed work-execution thread.
>> > > >>>
>> > > >>> It doesn't block the execution of the pipeline, but
>basically, it
>> > means
>> > > >>> that metrics don&

Re: [component] tag in JIRA tickets

2016-12-15 Thread Jean-Baptiste Onofré
Yes, agree. We had kind of similar discussion while ago: 
"java-sdk-extension" vs "io" afair ;)


Regards
JB

On 12/15/2016 03:12 PM, Amit Sela wrote:

If we can do this via JIRA component, even better, but then we would need
to work on components.

On Thu, Dec 15, 2016 at 3:45 PM Jean-Baptiste Onofré 
wrote:


Hi Amit,

interesting idea, even if it's redundant with the Jira component.
However, the Jira component is also generic (for instance,
java-sdk-extension is for both extensions and IOs).

If possible, I would more work on the component (and customize the
Release Notes output to include it).

Regards
JB

On 12/15/2016 02:30 PM, Amit Sela wrote:

I took a look at the release notes for 0.4.0-incubating now and I felt

like

it could have been "tagged" in a way that helps people focus on what's
interesting to them
Currently, all resolved issues simply appear as they are in JIRA, but we
don't have any way to tag them.

What if we were to prefix the issue title with the component, examples:
[runners-spark] fixed-some Spark runner issue
[SDK] added state and timers API
[IO] added HBaseIO support
...

This would be more readable, and allow users to focus by looking for

what's

interesting to them in a release (CTRL/CMD + F in the browser..)

Thoughts ?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [component] tag in JIRA tickets

2016-12-15 Thread Jean-Baptiste Onofré

Hi Amit,

interesting idea, even if it's redundant with the Jira component. 
However, the Jira component is also generic (for instance, 
java-sdk-extension is for both extensions and IOs).


If possible, I would more work on the component (and customize the 
Release Notes output to include it).


Regards
JB

On 12/15/2016 02:30 PM, Amit Sela wrote:

I took a look at the release notes for 0.4.0-incubating now and I felt like
it could have been "tagged" in a way that helps people focus on what's
interesting to them
Currently, all resolved issues simply appear as they are in JIRA, but we
don't have any way to tag them.

What if we were to prefix the issue title with the component, examples:
[runners-spark] fixed-some Spark runner issue
[SDK] added state and timers API
[IO] added HBaseIO support
...

This would be more readable, and allow users to focus by looking for what's
interesting to them in a release (CTRL/CMD + F in the browser..)

Thoughts ?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré

Hi Amit,

thanks for the update.

As you changed the Jira, the Release Notes are now up to date.

Regards
JB

On 12/15/2016 02:20 PM, Amit Sela wrote:

I see three problems in the release notes (related to Spark runner):

Improvement:

[BEAM-757] - The SparkRunner should utilize the SDK's DoFnRunner instead of
writing it's own.

[BEAM-807] - [SparkRunner] Replace OldDoFn with DoFn

[BEAM-855] - Remove the need for --streaming option in the spark runner

BEAM-855 is duplicate and probably shouldn't have had a Fix Version.

The other two are not a part of this release - I was probably too eager to
mark them fixed after merge and I accidentally put 0.4.0 as the Fix Version.

I made the changes in JIRA now.

Thanks,
Amit

On Thu, Dec 15, 2016 at 3:09 PM Jean-Baptiste Onofré 
wrote:


Reviewing and testing the release, I see:

16/12/15 14:04:47 ERROR MetricsContainer: Unable to update metrics on
the current thread. Most likely caused by using metrics outside the
managed work-execution thread.

It doesn't block the execution of the pipeline, but basically, it means
that metrics don't work anymore.

I'm investigating.

Regards
JB

On 12/15/2016 01:46 PM, Jean-Baptiste Onofré wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org

[2],

* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API

reference

manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1]


https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590


[2]

https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/

[3]

https://repository.apache.org/content/repositories/orgapachebeam-1006/

[4]


https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=85d1c8a2f85bbc667c90f55ff0eb27de5c2446a6


[5] https://github.com/apache/incubator-beam-site/pull/109


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré

Reviewing and testing the release, I see:

16/12/15 14:04:47 ERROR MetricsContainer: Unable to update metrics on 
the current thread. Most likely caused by using metrics outside the 
managed work-execution thread.


It doesn't block the execution of the pipeline, but basically, it means 
that metrics don't work anymore.


I'm investigating.

Regards
JB

On 12/15/2016 01:46 PM, Jean-Baptiste Onofré wrote:

Hi everyone,

Please review and vote on the release candidate #1 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1]
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590

[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3] https://repository.apache.org/content/repositories/orgapachebeam-1006/
[4]
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=85d1c8a2f85bbc667c90f55ff0eb27de5c2446a6

[5] https://github.com/apache/incubator-beam-site/pull/109


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[VOTE] Release 0.4.0-incubating, release candidate #1

2016-12-15 Thread Jean-Baptiste Onofré

Hi everyone,

Please review and vote on the release candidate #1 for the version
0.4.0-incubating, as follows:
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

The complete staging area is available for your review, which includes:
* JIRA release notes [1],
* the official Apache source release to be deployed to dist.apache.org [2],
* all artifacts to be deployed to the Maven Central Repository [3],
* source code tag "v0.4.0-incubating-RC1" [4],
* website pull request listing the release and publishing the API reference
manual [5].

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PPMC affirmative votes.

Thanks,
Regards
JB

[1] 
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338590
[2] https://dist.apache.org/repos/dist/dev/incubator/beam/0.4.0-incubating/
[3] https://repository.apache.org/content/repositories/orgapachebeam-1006/
[4] 
https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=85d1c8a2f85bbc667c90f55ff0eb27de5c2446a6
[5] https://github.com/apache/incubator-beam-site/pull/109


Re: HiveIO

2016-12-14 Thread Jean-Baptiste Onofré

Hi Tim,

I pushed the HBaseIO on github, I will do the same later today for 
HiveIO. I will let you know.


Thanks !

Regards
JB

On 12/15/2016 02:39 AM, Tim Taschke wrote:

Great to see that there is progress on this!

Like written on the user mailing list, I would also be interested in
contributing to this.

On Thu, Dec 15, 2016 at 4:44 AM, Ismaël Mejía  wrote:

For ref, I just created a JIRA so people can track the progress/contribute
to the progress of HiveIO.

https://issues.apache.org/jira/browse/BEAM-1158

On Wed, Dec 7, 2016 at 5:39 PM, Jean-Baptiste Onofré 
wrote:


Yes that's the first idea ;)

Regards
JB⁣

On Dec 7, 2016, 17:27, at 17:27, Vinoth Chandar  wrote:

Interesting. So all the planning & execution is done by Hive, and Beam
will
process the results of the query?

On Wed, Dec 7, 2016 at 8:24 AM, Jean-Baptiste Onofré 
wrote:


Hi⁣

The HiveIO will directly use the native API and HiveQL. That's the

plan on

which we are working right now.

Regards
JB

On Dec 7, 2016, 17:18, at 17:18, Vinoth Chandar 

wrote:

Hi,

I am not looking for a way to actually execute the query on Hive. I
would
like to do something similar to Spark SQL/HiveContext, but with

Beam.

Just
have a HiveIO that reads metadata from Hive metastore, and then

later

use a
Spark runner to execute the query.  So, HiveJDBC is not an option I
would
like to pursue. Thanks for the pointer, though!

And does the HiveIO that is being planned, work similarly as above?


Thanks
Vinoth



On Tue, Dec 6, 2016 at 4:55 AM, Ismaël Mejía 

wrote:



Hello,

If you really need to read/write via Hive, remember that you can

use

the

Hive Jdbc driver, and achieve this with Beam using the JdbcIO

(this

is

probably less efficient for the streaming case but still a valid

solution).


Ismaël


On Tue, Dec 6, 2016 at 12:04 PM, Vinoth Chandar 

wrote:



Great. Thanks!

Thanks,
Vinoth


On Dec 6, 2016, at 2:06 AM, Jean-Baptiste Onofré



wrote:


Hi,

Ismaël and I started HiveIO.

I have several IOs ready to propose as PR, but, in order to

limit

the

number of open PRs, I would like to merge the pending ones.


I will let you know when the branches/PRs will be available.

Regards
JB


On 12/05/2016 11:40 PM, Vinoth Chandar wrote:
Hi guys,

Saw a post around HiveIO on the users list with a PR

followup. I

am

interested in this too and can pitch in on developement and

testing..


Who & where is this work happening?

Thanks
VInoth



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com










--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Hosting data stores for IO Transform testing

2016-12-14 Thread Jean-Baptiste Onofré
uster management software/orchestration software - I

want

to make sure we land on the right tool here since choosing the wrong tool
could result in administration of the instances taking more work. I

suspect

that's a good place for a follow up discussion, so I'll start a separate
thread on that. I'm happy with whatever tool we choose, but I want to

make

sure we take a moment to consider different options and have a reason for
choosing one.

Etienne - thanks for being willing to port your creation/other scripts
over. You might be a good early tester of whether this system works well
for everyone.

Stephen

[1]  Reasons for Beam Test Strategy -




https://docs.google.com/document/d/153J9jPQhMCNi_eBzJfhAg-NprQ7vbf1jNVRgdqeEE8I/edit?ts=58349aec#



On Wed, Nov 23, 2016 at 12:48 AM Jean-Baptiste Onofré 
wrote:


I second Etienne there.

We worked together on the ElasticsearchIO and definitely, the high
valuable test we did were integration tests with ES on docker and high
volume.

I think we have to distinguish the two kinds of tests:
1. utests are located in the IO itself and basically they should cover
the core behaviors of the IO
2. itests are located as contrib in the IO (they could be part of the IO
but executed by the integration-test plugin or a specific profile) that
deals with "real" backend and high volumes. The resources required by
the itest can be bootstrapped by Jenkins (for instance using
Mesos/Marathon and docker images as already discussed, and it's what I'm
doing on my own "server").

It's basically what Stephen described.

We have to not relay only on itest: utests are very important and they
validate the core behavior.

My $0.01 ;)

Regards
JB

On 11/23/2016 09:27 AM, Etienne Chauchot wrote:

Hi Stephen,

I like your proposition very much and I also agree that docker + some
orchestration software would be great !

On the elasticsearchIO (PR to be created this week) there is docker
container creation scripts and logstash data ingestion script for IT
environment available in contrib directory alongside with integration
tests themselves. I'll be happy to make them compliant to new IT
environment.

What you say bellow about the need for external IT environment is
particularly true. As an example with ES what came out in first
implementation was that there were problems starting at some high

volume

of data (timeouts, ES windowing overflow...) that could not have be

seen

on embedded ES version. Also there where some particularities to
external instance like secondary (replica) shards that where not

visible

on embedded instance.

Besides, I also favor bringing up instances before test because it
allows (amongst other things) to be sure to start on a fresh dataset

for

the test to be deterministic.

Etienne


Le 23/11/2016 à 02:00, Stephen Sisk a écrit :

Hi,

I'm excited we're getting lots of discussion going. There are many
threads
of conversation here, we may choose to split some of them off into a
different email thread. I'm also betting I missed some of the
questions in
this thread, so apologies ahead of time for that. Also apologies for

the

amount of text, I provided some quick summaries at the top of each
section.

Amit - thanks for your thoughts. I've responded in detail below.
Ismael - thanks for offering to help. There's plenty of work here to

go

around. I'll try and think about how we can divide up some next steps
(probably in a separate thread.) The main next step I see is deciding
between kubernetes/mesos+marathon/docker swarm - I'm working on that,

but

having lots of different thoughts on what the advantages/disadvantages

of

those are would be helpful (I'm not entirely sure of the protocol for
collaborating on sub-projects like this.)

These issues are all related to what kind of tests we want to write. I
think a kubernetes/mesos/swarm cluster could support all the use cases
we've discussed here (and thus should not block moving forward with
this),
but understanding what we want to test will help us understand how the
cluster will be used. I'm working on a proposed user guide for testing

IO

Transforms, and I'm going to send out a link to that + a short summary

to

the list shortly so folks can get a better sense of where I'm coming
from.



Here's my thinking on the questions we've raised here -

Embedded versions of data stores for testing

Summary: yes! But we still need real data stores to test against.

I am a gigantic fan of using embedded versions of the various data
stores.
I think we should test everything we possibly can using them, and do

the

majority of our correctness testing using embedded versions + the

direct

runner. However, it's also important to have at least one test that
actually connects to an actual instance, so we can get coverage for
things
like credentials, real connection strings, etc..

Re: Hbase IO preview

2016-12-14 Thread Jean-Baptiste Onofré

Hi Andrew,

I have a protobuf issue on this IO that I would like to address.

Sorry, I didn't have time to work on it this week. I do my best to push 
something work-able asap.


Regards
JB

On 12/14/2016 03:18 PM, Andrew Hoblitzell wrote:

Any update on which branch the preview for HBase IO might be available in?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Review on Jira for 0.4.0-incubating

2016-12-13 Thread Jean-Baptiste Onofré

Hi,

Either way is fine for me too.

We discussed about the release schedule independently from the 
graduation process, that's why 0.4.0-incubator was planned around today.


Regards
JB

On 12/13/2016 06:02 PM, Daniel Kulp wrote:

Hate to suggest this….

Assuming the Board OK’s the graduation next Wednesday, if we wait till then to 
do the build, we can drop the the incubator stuff entirely and it could be a 
“first release” outside of incubation.   We could avoid the extra vote on the 
incubator list, etc….

Would it make sense to delay the week?   Not a big deal either way, but I don’t 
think I’ve ever seen a project do a release between the graduation vote and the 
board vote.   Every project I’ve seen decided to wait to have the “we’ve 
graduated!” release.

Dan




On Dec 13, 2016, at 9:43 AM, Dan Halperin  wrote:

Update: we think we've knocked off all the 0.4.0-incubating blockers,
including postponing some. JB is going to start the release process soon!

On Sat, Dec 3, 2016 at 10:42 PM, Jean-Baptiste Onofré 
wrote:


Very good point Frances.

Definitely something we have to do.

Regards
JB


On 12/04/2016 07:38 AM, Frances Perry wrote:


Sounds great, JB!

The major blocker in my opinion is to finish the polishing pass on the
quickstarts and example archetypes, so that users will have a great
experience trying out 0.4.0-incubating. I know we've made some significant
progress there in the last few weeks, but I don't think we've quite
finished. For example, https://issues.apache.org/jira/browse/BEAM-909 is
unresolved and marked as 0.4.0-incubating.

On Sat, Dec 3, 2016 at 10:26 PM, Jean-Baptiste Onofré 
wrote:

Hi beamers,


We plan a 0.4.0-incubating release pretty soon. I propose to manage this
release.

I started to review the Jira with fix version set to 0.4.0-incubating.

Please, update the fix version in Jira if you are working on specific
Jira
and you want to include in the 0.4.0-incubating release.

Thanks
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam Tuple

2016-12-13 Thread Jean-Baptiste Onofré

Hi Robert,

Agree, however which one the user would use ? Create his own one ?

Today, I think Beam is heavily flexible in term of data format (which is 
great), but the trade off is that the end-users have to write lot of 
boilerplate code (just to convert from one type to another).


So, basically, the purpose of a Beam Tuple is to have something provided 
out of box: if the user wants to use another tuple, that's fine.
Generally speaking, the discussion about data format extension is about 
to simplify the way for users to manipulate popular data formats.


Regards
JB

On 12/13/2016 05:56 PM, Robert Bradshaw wrote:

The Java language isn't very amenable to Tuple APIs as there are several
(mutually exclusive?) tradeoffs that must be made, each with their pros and
cons. What advantage is there of Beam providing its own tuple API vs.
letting users pick whatever tuple library they want and using that with
Beam?

(I suppose we're already using and encouraging AutoValue which covers a lot
of tuple cases.)

On Tue, Dec 13, 2016 at 8:20 AM, Aparup Banerjee (apbanerj) <
apban...@cisco.com> wrote:


We have created one. An untagged Tuple. Will be happy to contribute it to
the community

Aparup


On Dec 13, 2016, at 5:11 AM, Amit  wrote:

I'll add that I know of Beam's PTuple, but my question is about much
simpler Tuples, untagged.

On Tue, Dec 13, 2016 at 1:56 PM Jean-Baptiste Onofré 
wrote:


Hi Amit,

as discussed together, I think a Tuple abstraction would be good in the
SDK (more than in the data format extension).

Regards
JB


On 12/13/2016 11:06 AM, Amit Sela wrote:
Hi all,

I was wondering why Beam doesn't have tuples as part of the SDK ?
To the best of my knowledge all currently supported (OSS) runners:

Spark,

Flink, Apex provide a Tuple abstraction and I was wondering if Beam

should

too ?

Consider KV for example; it is a special ("*keyed*" by the first field)
implementation Tuple2.
While KV's importance is far more than being a Tuple2, I'm wondering if

the

SDK would benefit from a proper TupleX support ?

Thanks,
Amit



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Beam Tuple

2016-12-13 Thread Jean-Baptiste Onofré

Hi Amit,

as discussed together, I think a Tuple abstraction would be good in the 
SDK (more than in the data format extension).


Regards
JB

On 12/13/2016 11:06 AM, Amit Sela wrote:

Hi all,

I was wondering why Beam doesn't have tuples as part of the SDK ?
To the best of my knowledge all currently supported (OSS) runners: Spark,
Flink, Apex provide a Tuple abstraction and I was wondering if Beam should
too ?

Consider KV for example; it is a special ("*keyed*" by the first field)
implementation Tuple2.
While KV's importance is far more than being a Tuple2, I'm wondering if the
SDK would benefit from a proper TupleX support ?

Thanks,
Amit



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Scio Beam Scala API

2016-12-11 Thread Jean-Baptiste Onofré

Hi guys,

I will share the branch with the 0.4.0-incubating-SNAPSHOT update and 
tested with Spark runner.


Regards
JB

On 12/11/2016 09:13 PM, Neville Li wrote:

Aviem,

You're right we have a scio/apache-beam
<https://github.com/spotify/scio/tree/apache-beam> branch that works with
Beam 0.2.0-incubating and are working on keeping it up with the latest
releases.

A few ways you can contribute:
- It runs with the Dataflow runner but hasn't been tested with other
runners. You're welcome to give it a try and submit issues/PRs.
- `scio-core` is somewhat coupled with Dataflow runner and GCP IO
dependencies right now but it'd be nice to further decouple them so users
can swap other runner/IO packages easily.
- We also have a master ticket #279
<https://github.com/spotify/scio/issues/279> that keeps track of pending
issues for Beam migration.

Keep in mind that our team of 3 supports 150+ production Scio users within
Spotify so we simply don't have the bandwidth to maintain 2 diverging repo
(spotify/scio vs apache/beam-incubating) right now. We'll probably revisit
this when internal users switch over to Beam sometime next year.


On Sun, Dec 11, 2016 at 4:45 PM Jean-Baptiste Onofré 
wrote:


Hi

I'm working on a feature branch with Neville and his guys. I already
updated to last changes. I would like to propose a feature branch later
this week.

Regards
JB⁣​

On Dec 11, 2016, 16:39, at 16:39, Aviem Zur  wrote:

Hi,

I've heard there has been work towards porting Scio Dataflow Scala API
to
Beam.
I was wondering at what stage this is in, where is this happening (Saw
no
branch in BEAM repository, and one in Scio repository that is dependent
on
beam 0.2.0-INCUBATING) and if there is a way to contribute?






--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Scio Beam Scala API

2016-12-11 Thread Jean-Baptiste Onofré
Hi

I'm working on a feature branch with Neville and his guys. I already updated to 
last changes. I would like to propose a feature branch later this week.

Regards
JB⁣​

On Dec 11, 2016, 16:39, at 16:39, Aviem Zur  wrote:
>Hi,
>
>I've heard there has been work towards porting Scio Dataflow Scala API
>to
>Beam.
>I was wondering at what stage this is in, where is this happening (Saw
>no
>branch in BEAM repository, and one in Scio repository that is dependent
>on
>beam 0.2.0-INCUBATING) and if there is a way to contribute?


Re: Jenkins build is still unstable: beam_PostCommit_Java_RunnableOnService_Dataflow #1787

2016-12-10 Thread Jean-Baptiste Onofré

Hi Jason,

hmmm, I don't think it's possible to retry staging.

I also see on status.apache.org that Jenkins executors are a bit flaky 
since couple of days.


Regards
JB

On 12/11/2016 07:25 AM, Jason Kuster wrote:

Hm, I seem to have spoken too soon on the latest break -- looks like this
has been an ongoing issue. All the failures in the last 2 or 3 days have
been due to staging errors, and scattered ones before that. Is there any
way to retry staging if it fails?

On Sat, Dec 10, 2016 at 2:16 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:


See <https://builds.apache.org/job/beam_PostCommit_Java_
RunnableOnService_Dataflow/changes>







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Performance Benchmarking Beam

2016-12-10 Thread Jean-Baptiste Onofré

Cool !

Please use the mailing list and Jira to sync effort.

Thanks,
Regards
JB

On 12/10/2016 05:00 PM, Otávio Carvalho wrote:

Awesome, Jason!

I am also interested in contribute to this effort by building/porting
streaming microbenchmarks to Beam.

I will make contact in the following weeks.

Regards,
Otavio.

2016-12-09 15:11 GMT-02:00 Jean-Baptiste Onofré :


Happy to help too ;)

@Jason, as discussed together, I will send my config (Marathon JSON,
Dockerfile, ...), I'm so sorry to be late on this.

Regards
JB


On 12/09/2016 05:59 PM, Amit Sela wrote:


This is great Jason!

Let me know if / how I can assist with Spark, or generally.

Thanks,
Amit

On Thu, Dec 8, 2016 at 9:01 PM Jason Kuster 
wrote:

Hey all,


So as I mentioned on Stephen's IO Testing thread a few days ago I've been
doing a bunch of investigating into performance testing frameworks. I've
put all my thoughts into a doc here and I'd love to hear thoughts about
my
investigation and what I'm proposing going forward.

https://docs.google.com/document/d/18ffP1vYurvNe92Efs_
6hFFBDYC2dQEdWw135_GWZ2YU/view

Copying from the earlier mail:
The tl;dr version is that there are a number of tools out there, but that
the best one I was able to find was a tool called PerfKit Benchmarker
(PKB)[1]. As it turns out, they already had the ability to benchmark
Spark
(I have a PR out to extend the Spark functionality[2] and a couple more
improvements in the works), and I've put together some additional work
in a
branch on my repository[3] to enable proof-of-concept Dataflow Java
benchmarks. I'm pretty excited about it overall.

[1] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
[2] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1214
[3] https://github.com/jasonkuster/PerfKitBenchmarker/tree/beam

Looking forward to moving forward with this.

Jason

--
---
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Performance Benchmarking Beam

2016-12-09 Thread Jean-Baptiste Onofré

Happy to help too ;)

@Jason, as discussed together, I will send my config (Marathon JSON, 
Dockerfile, ...), I'm so sorry to be late on this.


Regards
JB

On 12/09/2016 05:59 PM, Amit Sela wrote:

This is great Jason!

Let me know if / how I can assist with Spark, or generally.

Thanks,
Amit

On Thu, Dec 8, 2016 at 9:01 PM Jason Kuster 
wrote:


Hey all,

So as I mentioned on Stephen's IO Testing thread a few days ago I've been
doing a bunch of investigating into performance testing frameworks. I've
put all my thoughts into a doc here and I'd love to hear thoughts about my
investigation and what I'm proposing going forward.

https://docs.google.com/document/d/18ffP1vYurvNe92Efs_
6hFFBDYC2dQEdWw135_GWZ2YU/view

Copying from the earlier mail:
The tl;dr version is that there are a number of tools out there, but that
the best one I was able to find was a tool called PerfKit Benchmarker
(PKB)[1]. As it turns out, they already had the ability to benchmark Spark
(I have a PR out to extend the Spark functionality[2] and a couple more
improvements in the works), and I've put together some additional work in a
branch on my repository[3] to enable proof-of-concept Dataflow Java
benchmarks. I'm pretty excited about it overall.

[1] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker
[2] https://github.com/GoogleCloudPlatform/PerfKitBenchmarker/pull/1214
[3] https://github.com/jasonkuster/PerfKitBenchmarker/tree/beam

Looking forward to moving forward with this.

Jason

--
---
Jason Kuster
Apache Beam (Incubating) / Google Cloud Dataflow





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Graduation to a top-level project

2016-12-08 Thread Jean-Baptiste Onofré

Thanks!

Davor

[1] http://community.apache.org/apache-way/apache-project-
maturity-model.html
[2] http://beam.incubator.apache.

org/contribute/maturity-model/




















--
Neelesh Srinivas Salian
Customer Operations Engineer





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] ExecIO

2016-12-08 Thread Jean-Baptiste Onofré

Hi guys,

I understand your point.

The Exec "IO" can already take input commands from a PCollection, but 
the user has to prepare the commands.
I will improve the ExecFn as you said: be able to construct the shell 
commands using elements in the PCollection (using one element as 
command, the others as arguments).


I agree with your statement about DoFn: a DoFn in a "middle" of a 
pipeline is not an IO. An IO acts as endpoints in a pipeline: starting 
endpoint for Read, ending endpoint for Write.


Point is that a DoFn can be a connector (for instance a MySQL database 
lookup as you said) but it can be wrapped as an IO.


If I compare with Apache Camel, a pipeline (aka route) starts with an 
unique (it's what we name a consumer endpoint on a Camel component). A 
producer endpoint can end a route or be used in any middle step. It 
provides a convenient way to extend the processing/routing logic.

It's like a DoFn.

Regards
JB

On 12/08/2016 09:37 PM, Ben Chambers wrote:

I think I agree with Robert (unless I'm misunderstanding his point).

I think that the shell commands are going to be the most useful if it is
possible to take the elements in an input PCollection, construct a shell
command depending on those elements, and then execute it. I think doing so
in a fully general manner outside of a DoFn will be difficult. If instead
we made it easier to declare a DoFn as having requirements on the
environment (these programs must be available in the shell) and easier to
execute shell commands within a DoFn, I think that covers many more use
cases.

On Thu, Dec 8, 2016 at 12:23 PM Robert Bradshaw 
wrote:


On Wed, Dec 7, 2016 at 1:32 AM, Jean-Baptiste Onofré 
wrote:

By the way, just to elaborate a bit why I provided as an IO:

1. From an user experience perspective, I think we have to provide
convenient way to write pipeline. Any syntax simplifying this is

valuable.

I think it's easier to write:

pipeline.apply(ExecIO.read().withCommand("foo"))

than:

pipeline.apply(Create.of("foo")).apply(ParDo.of(new ExecFn());


Slightly. Still, when I see

pipeline.apply(ExecIO.read().withCommand("foo"))

I am surprised to get a PCollection with a single element...


2. For me (maybe I'm wrong ;)), an IO is an extension dedicated for
"connector": reading/writing from/to a data source. So, even without the

IO

"wrapping" (by wrapping, I mean the Read and Write), I think Exec

extension

should be in IO as it's a source/write of data.


To clarify, if you wrote a DoFn that, say, did lookups against a MySQL
database, you would consider this an IO? For me, IO denotes
input/output, i.e. the roots and leaves of a pipeline.


Regards
JB

On 12/07/2016 08:37 AM, Robert Bradshaw wrote:


I don't mean to derail the tricky environment questions, but I'm not
seeing why this is bundled as an IO rather than a plain DoFn (which
can be applied to a PCollection of one or more commands, yielding
their outputs). Especially for the case of a Read, which in this case
is not splittable (initially or dynamically) and always produces a
single element--feels much more like a Map to me.

On Tue, Dec 6, 2016 at 3:26 PM, Eugene Kirpichov
 wrote:


Ben - the issues of "things aren't hung, there is a shell command
running",
aren't they general to all DoFn's? i.e. I don't see why the runner

would

need to know that a shell command is running, but not that, say, a

heavy

monolithic computation is running. What's the benefit to the runner in
knowing that the DoFn contains a shell command?

By saying "making sure that all shell commands finish", I suppose

you're

referring to the possibility of leaks if the user initiates a shell
command
and forgets to wait for it? I think that should be solvable again

without

Beam intervention, by making a utility class for running shell commands
which implements AutoCloseable, and document that you have to use it

that

way.

Ken - I think the question here is: are we ok with a situation where

the

runner doesn't check or care whether the shell command can run, and the
user accepts this risk and studies what commands will be available on

the

worker environment provided by the runner they use in production,

before

productionizing a pipeline with those commands.

Upon some thought I think it's ok. Of course, this carries an

obligation

for runners to document their worker environment and its changes across
versions. Though for many runners such documentation may be trivial:
"whatever your YARN cluster has, the runner doesn't change it in any

way"

and it may be good enough for users. And for other runners, like
Dataflow,
such documentation may also be trivial: "no guarantees whatsoever, only
what you stage in --filesToStage is available".

I can also see Beam develop to a point where we'd want a

Re: [DISCUSS] [BEAM-438] Rename one of PTransform.apply or PInput.apply

2016-12-07 Thread Jean-Baptiste Onofré

+1

Regards
JB

On 12/07/2016 10:37 PM, Kenneth Knowles wrote:

Hi all,

I want to bring up another major backwards-incompatible change before it is
too late, to resolve [BEAM-438].

Summary: Leave PInput.apply the same but rename PTransform.apply to
PTransform.expand. I have opened [PR #1538] just for reference (it took 30
seconds using IDE automated refactor)

This change affects *PTransform authors* but does *not* affect pipeline
authors.

This issue was filed a long time ago. It has been a problem many times with
actual users since before Beam started incubating. This is what goes wrong
(often):

   PCollection input = ...
   PTransform, ...> transform = ...

   transform.apply(input)

This type checks and even looks perfectly normal. Do you see the error?

... what we need the user to write is:

input.apply(transform)

What a confusing difference! After all, the first one type-checks and the
first one is how you apply a Function or Predicate or SerializableFunction,
etc. But it is broken. With transform.apply(input) the transform is not
registered with the pipeline at all.

We obviously can't (and don't want to) change the most core way that
pipeline authors use Beam, so PInput.apply (aka PCollection.apply) must
remain the same. But we do need a way to make it impossible to mix these up.

The simplest way I can think of is to choose a new name for the other
method involved. Users probably won't write transform.expand(input) since
they will never have seen it in any examples, etc. This will just make
PTransform authors need to do a global rename, and the type system will
direct them to all cases so there is no silent failure possible.

What do you think?

Kenn

[BEAM-438] https://issues.apache.org/jira/browse/BEAM-438
[PR #1538] https://github.com/apache/incubator-beam/pull/1538

p.s. there is a really amusing and confusing call chain: PCollection.apply
-> Pipeline.applyTransform -> Pipeline.applyInternal ->
PipelineRunner.apply -> PTransform.apply

After this change and work to get the runner out of the loop, it becomes
PCollection.apply -> Pipeline.applyTransform -> PTransform.expand



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Naming and API for executing shell commands

2016-12-07 Thread Jean-Baptiste Onofré

Hi Eugene,

I like your ShellCommands.execute().withCommand("foo") !

And you listed valid points and usages, especially around the 
input/output of the command.


My question is where do we put such ShellCommands extension ? As a 
module under IO ? As a new extensions module ?


Regards
JB

On 12/07/2016 06:24 PM, Eugene Kirpichov wrote:

Branched off into a separate thread.

How about ShellCommands.execute().withCommand("foo")? This is what it is -
it executes shell commands :)

Say, if I want to just execute a command for the sake of its side effect,
but I'm not interested in its output - it would feel odd to describe that
as either "reading" from the command or "writing" to it. Likewise, when I
execute commands in bash, I'm not thinking of it as reading or writing to
them.

Though, there are various modes of interaction with shell commands; some of
them could be called "reading" or "writing" I guess - or both!
- The command itself can be specified at pipeline construction time, or
fully dynamic (elements of a PCollection are themselves commands), or be
constructed from a fixed command and a variable set of arguments coming
from the PCollection (one-by-one or xargs-style, cramming as many arguments
as fit into the command line limit).
- We may also be writing elements of the PCollection to standard input of
the command - one-by-one, or in arbitrarily sized batches.
- We may be reading the command's stdout, its stderr, and its error code.

I think these options call for a more flexible naming and set of APIs than
read and write. And more flexible than a single DoFn, too (which is
something I hadn't thought of before - this connector definitely has room
for doing some interesting things).

On Wed, Dec 7, 2016 at 1:32 AM Jean-Baptiste Onofré  wrote:


By the way, just to elaborate a bit why I provided as an IO:

1. From an user experience perspective, I think we have to provide
convenient way to write pipeline. Any syntax simplifying this is valuable.
I think it's easier to write:

pipeline.apply(ExecIO.read().withCommand("foo"))

than:

pipeline.apply(Create.of("foo")).apply(ParDo.of(new ExecFn());

2. For me (maybe I'm wrong ;)), an IO is an extension dedicated for
"connector": reading/writing from/to a data source. So, even without the
IO "wrapping" (by wrapping, I mean the Read and Write), I think Exec
extension should be in IO as it's a source/write of data.

Regards
JB

On 12/07/2016 08:37 AM, Robert Bradshaw wrote:

I don't mean to derail the tricky environment questions, but I'm not
seeing why this is bundled as an IO rather than a plain DoFn (which
can be applied to a PCollection of one or more commands, yielding
their outputs). Especially for the case of a Read, which in this case
is not splittable (initially or dynamically) and always produces a
single element--feels much more like a Map to me.

On Tue, Dec 6, 2016 at 3:26 PM, Eugene Kirpichov
 wrote:

Ben - the issues of "things aren't hung, there is a shell command

running",

aren't they general to all DoFn's? i.e. I don't see why the runner would
need to know that a shell command is running, but not that, say, a heavy
monolithic computation is running. What's the benefit to the runner in
knowing that the DoFn contains a shell command?

By saying "making sure that all shell commands finish", I suppose you're
referring to the possibility of leaks if the user initiates a shell

command

and forgets to wait for it? I think that should be solvable again

without

Beam intervention, by making a utility class for running shell commands
which implements AutoCloseable, and document that you have to use it

that

way.

Ken - I think the question here is: are we ok with a situation where the
runner doesn't check or care whether the shell command can run, and the
user accepts this risk and studies what commands will be available on

the

worker environment provided by the runner they use in production, before
productionizing a pipeline with those commands.

Upon some thought I think it's ok. Of course, this carries an obligation
for runners to document their worker environment and its changes across
versions. Though for many runners such documentation may be trivial:
"whatever your YARN cluster has, the runner doesn't change it in any

way"

and it may be good enough for users. And for other runners, like

Dataflow,

such documentation may also be trivial: "no guarantees whatsoever, only
what you stage in --filesToStage is available".

I can also see Beam develop to a point where we'd want all runners to be
able to run your DoFn in a user-specified Docker container, and manage
those intelligently - but I think that's quite a while away and it

doesn't

have to block work on a utility for executing shell comm

Re: HiveIO

2016-12-07 Thread Jean-Baptiste Onofré
Yes that's the first idea ;)

Regards
JB⁣​

On Dec 7, 2016, 17:27, at 17:27, Vinoth Chandar  wrote:
>Interesting. So all the planning & execution is done by Hive, and Beam
>will
>process the results of the query?
>
>On Wed, Dec 7, 2016 at 8:24 AM, Jean-Baptiste Onofré 
>wrote:
>
>> Hi⁣
>>
>> The HiveIO will directly use the native API and HiveQL. That's the
>plan on
>> which we are working right now.
>>
>> Regards
>> JB
>>
>> On Dec 7, 2016, 17:18, at 17:18, Vinoth Chandar 
>wrote:
>> >Hi,
>> >
>> >I am not looking for a way to actually execute the query on Hive. I
>> >would
>> >like to do something similar to Spark SQL/HiveContext, but with
>Beam.
>> >Just
>> >have a HiveIO that reads metadata from Hive metastore, and then
>later
>> >use a
>> >Spark runner to execute the query.  So, HiveJDBC is not an option I
>> >would
>> >like to pursue. Thanks for the pointer, though!
>> >
>> >And does the HiveIO that is being planned, work similarly as above?
>> >
>> >
>> >Thanks
>> >Vinoth
>> >
>> >
>> >
>> >On Tue, Dec 6, 2016 at 4:55 AM, Ismaël Mejía 
>wrote:
>> >
>> >> Hello,
>> >>
>> >> If you really need to read/write via Hive, remember that you can
>use
>> >the
>> >> Hive Jdbc driver, and achieve this with Beam using the JdbcIO
>(this
>> >is
>> >> probably less efficient for the streaming case but still a valid
>> >solution).
>> >>
>> >> Ismaël
>> >>
>> >>
>> >> On Tue, Dec 6, 2016 at 12:04 PM, Vinoth Chandar 
>> >wrote:
>> >>
>> >> > Great. Thanks!
>> >> >
>> >> > Thanks,
>> >> > Vinoth
>> >> >
>> >> > > On Dec 6, 2016, at 2:06 AM, Jean-Baptiste Onofré
>> >
>> >> > wrote:
>> >> > >
>> >> > > Hi,
>> >> > >
>> >> > > Ismaël and I started HiveIO.
>> >> > >
>> >> > > I have several IOs ready to propose as PR, but, in order to
>limit
>> >the
>> >> > number of open PRs, I would like to merge the pending ones.
>> >> > >
>> >> > > I will let you know when the branches/PRs will be available.
>> >> > >
>> >> > > Regards
>> >> > > JB
>> >> > >
>> >> > >> On 12/05/2016 11:40 PM, Vinoth Chandar wrote:
>> >> > >> Hi guys,
>> >> > >>
>> >> > >> Saw a post around HiveIO on the users list with a PR
>followup. I
>> >am
>> >> > >> interested in this too and can pitch in on developement and
>> >testing..
>> >> > >>
>> >> > >> Who & where is this work happening?
>> >> > >>
>> >> > >> Thanks
>> >> > >> VInoth
>> >> > >>
>> >> > >
>> >> > > --
>> >> > > Jean-Baptiste Onofré
>> >> > > jbono...@apache.org
>> >> > > http://blog.nanthrax.net
>> >> > > Talend - http://www.talend.com
>> >> >
>> >>
>>


Re: HiveIO

2016-12-07 Thread Jean-Baptiste Onofré
Hi⁣

The HiveIO will directly use the native API and HiveQL. That's the plan on 
which we are working right now.

Regards
JB

On Dec 7, 2016, 17:18, at 17:18, Vinoth Chandar  wrote:
>Hi,
>
>I am not looking for a way to actually execute the query on Hive. I
>would
>like to do something similar to Spark SQL/HiveContext, but with Beam.
>Just
>have a HiveIO that reads metadata from Hive metastore, and then later
>use a
>Spark runner to execute the query.  So, HiveJDBC is not an option I
>would
>like to pursue. Thanks for the pointer, though!
>
>And does the HiveIO that is being planned, work similarly as above?
>
>
>Thanks
>Vinoth
>
>
>
>On Tue, Dec 6, 2016 at 4:55 AM, Ismaël Mejía  wrote:
>
>> Hello,
>>
>> If you really need to read/write via Hive, remember that you can use
>the
>> Hive Jdbc driver, and achieve this with Beam using the JdbcIO (this
>is
>> probably less efficient for the streaming case but still a valid
>solution).
>>
>> Ismaël
>>
>>
>> On Tue, Dec 6, 2016 at 12:04 PM, Vinoth Chandar 
>wrote:
>>
>> > Great. Thanks!
>> >
>> > Thanks,
>> > Vinoth
>> >
>> > > On Dec 6, 2016, at 2:06 AM, Jean-Baptiste Onofré
>
>> > wrote:
>> > >
>> > > Hi,
>> > >
>> > > Ismaël and I started HiveIO.
>> > >
>> > > I have several IOs ready to propose as PR, but, in order to limit
>the
>> > number of open PRs, I would like to merge the pending ones.
>> > >
>> > > I will let you know when the branches/PRs will be available.
>> > >
>> > > Regards
>> > > JB
>> > >
>> > >> On 12/05/2016 11:40 PM, Vinoth Chandar wrote:
>> > >> Hi guys,
>> > >>
>> > >> Saw a post around HiveIO on the users list with a PR followup. I
>am
>> > >> interested in this too and can pitch in on developement and
>testing..
>> > >>
>> > >> Who & where is this work happening?
>> > >>
>> > >> Thanks
>> > >> VInoth
>> > >>
>> > >
>> > > --
>> > > Jean-Baptiste Onofré
>> > > jbono...@apache.org
>> > > http://blog.nanthrax.net
>> > > Talend - http://www.talend.com
>> >
>>


Re: DataCamp II Salzburg

2016-12-07 Thread Jean-Baptiste Onofré

Hi Sergio,

thanks again for sharing !

Great work !

Regards
JB

On 12/07/2016 04:21 PM, Sergio Fernández wrote:

The slides I used are available at
http://www.slideshare.net/Wikier/introduction-to-apache-beam-incubating-datacamp-salzburg-7-dec-2016

People really like it ;-)

On Fri, Dec 2, 2016 at 8:21 AM, Davor Bonaci 
wrote:


This is great! (Please share any recording after the event if available).

On Thu, Dec 1, 2016 at 6:04 AM, Sergio Fernández 
wrote:


Hi folks,

the next week we have in Salzburg a DataCamp, a meetup about Big Data:

https://www.meetup.com/Salzburg-Big-Data-Meetup/events/231844168/

Where I'm going to make a session introducing Apache Beam.

I'll share here the material afterwards, very introductory anyway.

Cheers,

--
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co









--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] ExecIO

2016-12-07 Thread Jean-Baptiste Onofré

By the way, just to elaborate a bit why I provided as an IO:

1. From an user experience perspective, I think we have to provide 
convenient way to write pipeline. Any syntax simplifying this is valuable.

I think it's easier to write:

pipeline.apply(ExecIO.read().withCommand("foo"))

than:

pipeline.apply(Create.of("foo")).apply(ParDo.of(new ExecFn());

2. For me (maybe I'm wrong ;)), an IO is an extension dedicated for 
"connector": reading/writing from/to a data source. So, even without the 
IO "wrapping" (by wrapping, I mean the Read and Write), I think Exec 
extension should be in IO as it's a source/write of data.


Regards
JB

On 12/07/2016 08:37 AM, Robert Bradshaw wrote:

I don't mean to derail the tricky environment questions, but I'm not
seeing why this is bundled as an IO rather than a plain DoFn (which
can be applied to a PCollection of one or more commands, yielding
their outputs). Especially for the case of a Read, which in this case
is not splittable (initially or dynamically) and always produces a
single element--feels much more like a Map to me.

On Tue, Dec 6, 2016 at 3:26 PM, Eugene Kirpichov
 wrote:

Ben - the issues of "things aren't hung, there is a shell command running",
aren't they general to all DoFn's? i.e. I don't see why the runner would
need to know that a shell command is running, but not that, say, a heavy
monolithic computation is running. What's the benefit to the runner in
knowing that the DoFn contains a shell command?

By saying "making sure that all shell commands finish", I suppose you're
referring to the possibility of leaks if the user initiates a shell command
and forgets to wait for it? I think that should be solvable again without
Beam intervention, by making a utility class for running shell commands
which implements AutoCloseable, and document that you have to use it that
way.

Ken - I think the question here is: are we ok with a situation where the
runner doesn't check or care whether the shell command can run, and the
user accepts this risk and studies what commands will be available on the
worker environment provided by the runner they use in production, before
productionizing a pipeline with those commands.

Upon some thought I think it's ok. Of course, this carries an obligation
for runners to document their worker environment and its changes across
versions. Though for many runners such documentation may be trivial:
"whatever your YARN cluster has, the runner doesn't change it in any way"
and it may be good enough for users. And for other runners, like Dataflow,
such documentation may also be trivial: "no guarantees whatsoever, only
what you stage in --filesToStage is available".

I can also see Beam develop to a point where we'd want all runners to be
able to run your DoFn in a user-specified Docker container, and manage
those intelligently - but I think that's quite a while away and it doesn't
have to block work on a utility for executing shell commands. Though it'd
be nice if the utility was forward-compatible with that future world.

On Tue, Dec 6, 2016 at 2:16 AM Jean-Baptiste Onofré  wrote:


Hi Eugene,

thanks for the extended questions.

I think we have two levels of expectations here:
- end-user responsibility
- worker/runner responsibility

1/ From a end-user perspective, the end-user has to know that using a
system command (via ExecIO) and more generally speaking anything which
relay on worker resources (for instance a local filesystem directory
available only on a worker) can fail if the expected resource is not
present on all workers. So, basically, all workers should have the same
topology. It's what I'm assuming for the PR.
For example, I have my Spark cluster, using the same Mesos/Docker setup,
then the user knows that all nodes in the cluster will have the same
setup and so resources (it could be provided by DevOps for instance).
On the other hand, running on Dataflow is different because I don't
"control" the nodes (bootstrapping or resources), but in that case, the
user knows it (he knows the runner he's using).

2/ As you said, we can expect that runner can deal with some
requirements (expressed depending of the pipeline and the runner), and
the runner can know the workers which provide capabilities matching
those requirements.
Then, the end user is not more responsible: the runner will try to
define if the pipeline can be executed, and where a DoFn has to be run
(on which worker).

For me, it's two different levels where 2 is smarter but 1 can also make
sense.

WDYT ?

Regards
JB

On 12/05/2016 08:51 PM, Eugene Kirpichov wrote:

Hi JB,

Thanks for bringing this to the mailing list. I also think that this is
useful in general (and that use cases for Beam are more than just classic
bigdata), and that there are interesting question

Re: [DISCUSS] ExecIO

2016-12-06 Thread Jean-Baptiste Onofré

Hi Robert,

The "wrapping" as IO is more for convenience for end users. The 
Read/Write can be replaced by documentation/javadoc.


But you are right, the key part is the ExecFn.

Regards
JB

On 12/07/2016 08:37 AM, Robert Bradshaw wrote:

I don't mean to derail the tricky environment questions, but I'm not
seeing why this is bundled as an IO rather than a plain DoFn (which
can be applied to a PCollection of one or more commands, yielding
their outputs). Especially for the case of a Read, which in this case
is not splittable (initially or dynamically) and always produces a
single element--feels much more like a Map to me.

On Tue, Dec 6, 2016 at 3:26 PM, Eugene Kirpichov
 wrote:

Ben - the issues of "things aren't hung, there is a shell command running",
aren't they general to all DoFn's? i.e. I don't see why the runner would
need to know that a shell command is running, but not that, say, a heavy
monolithic computation is running. What's the benefit to the runner in
knowing that the DoFn contains a shell command?

By saying "making sure that all shell commands finish", I suppose you're
referring to the possibility of leaks if the user initiates a shell command
and forgets to wait for it? I think that should be solvable again without
Beam intervention, by making a utility class for running shell commands
which implements AutoCloseable, and document that you have to use it that
way.

Ken - I think the question here is: are we ok with a situation where the
runner doesn't check or care whether the shell command can run, and the
user accepts this risk and studies what commands will be available on the
worker environment provided by the runner they use in production, before
productionizing a pipeline with those commands.

Upon some thought I think it's ok. Of course, this carries an obligation
for runners to document their worker environment and its changes across
versions. Though for many runners such documentation may be trivial:
"whatever your YARN cluster has, the runner doesn't change it in any way"
and it may be good enough for users. And for other runners, like Dataflow,
such documentation may also be trivial: "no guarantees whatsoever, only
what you stage in --filesToStage is available".

I can also see Beam develop to a point where we'd want all runners to be
able to run your DoFn in a user-specified Docker container, and manage
those intelligently - but I think that's quite a while away and it doesn't
have to block work on a utility for executing shell commands. Though it'd
be nice if the utility was forward-compatible with that future world.

On Tue, Dec 6, 2016 at 2:16 AM Jean-Baptiste Onofré  wrote:


Hi Eugene,

thanks for the extended questions.

I think we have two levels of expectations here:
- end-user responsibility
- worker/runner responsibility

1/ From a end-user perspective, the end-user has to know that using a
system command (via ExecIO) and more generally speaking anything which
relay on worker resources (for instance a local filesystem directory
available only on a worker) can fail if the expected resource is not
present on all workers. So, basically, all workers should have the same
topology. It's what I'm assuming for the PR.
For example, I have my Spark cluster, using the same Mesos/Docker setup,
then the user knows that all nodes in the cluster will have the same
setup and so resources (it could be provided by DevOps for instance).
On the other hand, running on Dataflow is different because I don't
"control" the nodes (bootstrapping or resources), but in that case, the
user knows it (he knows the runner he's using).

2/ As you said, we can expect that runner can deal with some
requirements (expressed depending of the pipeline and the runner), and
the runner can know the workers which provide capabilities matching
those requirements.
Then, the end user is not more responsible: the runner will try to
define if the pipeline can be executed, and where a DoFn has to be run
(on which worker).

For me, it's two different levels where 2 is smarter but 1 can also make
sense.

WDYT ?

Regards
JB

On 12/05/2016 08:51 PM, Eugene Kirpichov wrote:

Hi JB,

Thanks for bringing this to the mailing list. I also think that this is
useful in general (and that use cases for Beam are more than just classic
bigdata), and that there are interesting questions here at different

levels

about how to do it right.

I suggest to start with the highest-level question [and discuss the
particular API only after agreeing on this, possibly in a separate

thread]:

how to deal with the fact that Beam gives no guarantees about the
environment on workers, e.g. which commands are available, which shell or
even OS is being used, etc. Particularly:

- Obviously different runners will have a different environment, e.g.
Dataflow workers are not going to have

Re: [DISCUSS] ExecIO

2016-12-06 Thread Jean-Baptiste Onofré

Hi Eugene,

thanks for the extended questions.

I think we have two levels of expectations here:
- end-user responsibility
- worker/runner responsibility

1/ From a end-user perspective, the end-user has to know that using a 
system command (via ExecIO) and more generally speaking anything which 
relay on worker resources (for instance a local filesystem directory 
available only on a worker) can fail if the expected resource is not 
present on all workers. So, basically, all workers should have the same 
topology. It's what I'm assuming for the PR.
For example, I have my Spark cluster, using the same Mesos/Docker setup, 
then the user knows that all nodes in the cluster will have the same 
setup and so resources (it could be provided by DevOps for instance).
On the other hand, running on Dataflow is different because I don't 
"control" the nodes (bootstrapping or resources), but in that case, the 
user knows it (he knows the runner he's using).


2/ As you said, we can expect that runner can deal with some 
requirements (expressed depending of the pipeline and the runner), and 
the runner can know the workers which provide capabilities matching 
those requirements.
Then, the end user is not more responsible: the runner will try to 
define if the pipeline can be executed, and where a DoFn has to be run 
(on which worker).


For me, it's two different levels where 2 is smarter but 1 can also make 
sense.


WDYT ?

Regards
JB

On 12/05/2016 08:51 PM, Eugene Kirpichov wrote:

Hi JB,

Thanks for bringing this to the mailing list. I also think that this is
useful in general (and that use cases for Beam are more than just classic
bigdata), and that there are interesting questions here at different levels
about how to do it right.

I suggest to start with the highest-level question [and discuss the
particular API only after agreeing on this, possibly in a separate thread]:
how to deal with the fact that Beam gives no guarantees about the
environment on workers, e.g. which commands are available, which shell or
even OS is being used, etc. Particularly:

- Obviously different runners will have a different environment, e.g.
Dataflow workers are not going to have Hadoop commands available because
they are not running on a Hadoop cluster. So, pipelines and transforms
developed using this connector will be necessarily non-portable between
different runners. Maybe this is ok? But we need to give users a clear
expectation about this. How do we phrase this expectation and where do we
put it in the docs?

- I'm concerned that this puts additional compatibility requirements on
runners - it becomes necessary for a runner to document the environment of
its workers (OS, shell, privileges, guaranteed-installed packages, access
to other things on the host machine e.g. whether or not the worker runs in
its own container, etc.) and to keep it stable - otherwise transforms and
pipelines with this connector will be non-portable between runner versions
either.

Another way to deal with this is to give up and say "the environment on the
workers is outside the scope of Beam; consult your runner's documentation
or use your best judgment as to what the environment will be, and use this
at your own risk".

What do others think?

On Mon, Dec 5, 2016 at 5:09 AM Jean-Baptiste Onofré  wrote:

Hi beamers,

Today, Beam is mainly focused on data processing.
Since the beginning of the project, we are discussing about extending
the use cases coverage via DSLs and extensions (like for machine
learning), or via IO.

Especially for the IO, we can see Beam use for data integration and data
ingestion.

In this area, I'm proposing a first IO: ExecIO:

https://issues.apache.org/jira/browse/BEAM-1059
https://github.com/apache/incubator-beam/pull/1451

Actually, this IO is mainly an ExecFn that executes system commands
(again, keep in mind we are discussing about data integration/ingestion
and not data processing).

For convenience, this ExecFn is wrapped in Read and Write (as a regular IO).

Clearly, this IO/Fn depends of the worker where it runs. But it's under
the user responsibility.

During the review, Eugene and I discussed about:
- is it an IO or just a fn ?
- is it OK to have worker specific IO ?

IMHO, an IO makes lot of sense to me and it's very convenient for end
users. They can do something like:

PCollection output =
pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh"));

The pipeline will execute myscript and the output pipeline will contain
command execution std out/err.

On the other hand, they can do:

pcollection.apply(ExecIO.write());

where PCollection contains the commands to execute.

Generally speaking, end users can call ExecFn wherever they want in the
pipeline steps:

PCollection output = pipeline.apply(ParDo.of(new ExecIO.ExecFn()));

The input collection contains the commands to execute, and the output
collection contains the commands e

Re: HiveIO

2016-12-06 Thread Jean-Baptiste Onofré

Hi,

Ismaël and I started HiveIO.

I have several IOs ready to propose as PR, but, in order to limit the 
number of open PRs, I would like to merge the pending ones.


I will let you know when the branches/PRs will be available.

Regards
JB

On 12/05/2016 11:40 PM, Vinoth Chandar wrote:

Hi guys,

Saw a post around HiveIO on the users list with a PR followup. I am
interested in this too and can pitch in on developement and testing..

Who & where is this work happening?

Thanks
VInoth



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[INFO] Spark runner build is failing

2016-12-05 Thread Jean-Baptiste Onofré

Hi guys,

The latest commit on the Spark runner broke the Spark runner:

commit 158378f0f682b80462b917002b895ddbf782d06d
Date:   Sat Dec 3 00:47:39 2016 +0200

This commit introduced a failed test:

Failed tests:
  ResumeFromCheckpointStreamingTest.testRun:131->runAgain:142->run:169 
Success aggregator should be greater than zero.

Expected: not <0>
 but: was <0>

I'm fixing that asap.

Sorry about that.

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


[DISCUSS] ExecIO

2016-12-05 Thread Jean-Baptiste Onofré

Hi beamers,

Today, Beam is mainly focused on data processing.
Since the beginning of the project, we are discussing about extending 
the use cases coverage via DSLs and extensions (like for machine 
learning), or via IO.


Especially for the IO, we can see Beam use for data integration and data 
ingestion.


In this area, I'm proposing a first IO: ExecIO:

https://issues.apache.org/jira/browse/BEAM-1059
https://github.com/apache/incubator-beam/pull/1451

Actually, this IO is mainly an ExecFn that executes system commands 
(again, keep in mind we are discussing about data integration/ingestion 
and not data processing).


For convenience, this ExecFn is wrapped in Read and Write (as a regular IO).

Clearly, this IO/Fn depends of the worker where it runs. But it's under 
the user responsibility.


During the review, Eugene and I discussed about:
- is it an IO or just a fn ?
- is it OK to have worker specific IO ?

IMHO, an IO makes lot of sense to me and it's very convenient for end 
users. They can do something like:


PCollection output = 
pipeline.apply(ExecIO.read().withCommand("/path/to/myscript.sh"));


The pipeline will execute myscript and the output pipeline will contain 
command execution std out/err.


On the other hand, they can do:

pcollection.apply(ExecIO.write());

where PCollection contains the commands to execute.

Generally speaking, end users can call ExecFn wherever they want in the 
pipeline steps:


PCollection output = pipeline.apply(ParDo.of(new ExecIO.ExecFn()));

The input collection contains the commands to execute, and the output 
collection contains the commands execution result std out/err.


Generally speaking, I'm preparing several IOs more on the data 
integration/ingestion area than on "pure" classic big data processing. I 
think it would give a new "dimension" to Beam.


Thoughts ?

Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: HBase IO

2016-12-04 Thread Jean-Baptiste Onofré

By the way, I will share a branch with preview of HBase IO.

Regards
JB

On 12/05/2016 08:10 AM, Jean-Baptiste Onofré wrote:

Hi,

Ismaël started to experiment and PoC a HBaseIO.

As a workaround waiting for the IO (even if it won't provide all
features), you can use our own DoFn.

Regards
JB

On 11/28/2016 08:18 AM, 钱爽(子颢) wrote:

Hello, I’m using Beam in my program, is the HBase IO undergoing? Thank
you!





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: HBase IO

2016-12-04 Thread Jean-Baptiste Onofré

Hi,

Ismaël started to experiment and PoC a HBaseIO.

As a workaround waiting for the IO (even if it won't provide all 
features), you can use our own DoFn.


Regards
JB

On 11/28/2016 08:18 AM, 钱爽(子颢) wrote:

Hello, I’m using Beam in my program, is the HBase IO undergoing? Thank you!



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Introduction + new contributions

2016-12-04 Thread Jean-Baptiste Onofré

I will do the review.

For BEAM-51, I'm not sure as it could be related to the dataformat 
discussion. Maybe you can participate in this discussion.


Regards
JB

On 12/04/2016 10:56 AM, Vladisav Jelisavcic wrote:

Thanks!

Please someone review the PR,
and assign me the next one: BEAM-51

Regards,
Vladisav



On Sun, Dec 4, 2016 at 6:33 AM, Jean-Baptiste Onofré 
wrote:


Welcome aboard !

Regards
JB


On 12/03/2016 03:08 PM, Vladisav Jelisavcic wrote:


Hi,

my name is Vladisav, and I would like to get involved in Apache Beam.
For starters, I'll do something simple, e.g.: BEAM-961

Best regards,
Vladisav



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Review on Jira for 0.4.0-incubating

2016-12-03 Thread Jean-Baptiste Onofré

Very good point Frances.

Definitely something we have to do.

Regards
JB

On 12/04/2016 07:38 AM, Frances Perry wrote:

Sounds great, JB!

The major blocker in my opinion is to finish the polishing pass on the
quickstarts and example archetypes, so that users will have a great
experience trying out 0.4.0-incubating. I know we've made some significant
progress there in the last few weeks, but I don't think we've quite
finished. For example, https://issues.apache.org/jira/browse/BEAM-909 is
unresolved and marked as 0.4.0-incubating.

On Sat, Dec 3, 2016 at 10:26 PM, Jean-Baptiste Onofré 
wrote:


Hi beamers,

We plan a 0.4.0-incubating release pretty soon. I propose to manage this
release.

I started to review the Jira with fix version set to 0.4.0-incubating.

Please, update the fix version in Jira if you are working on specific Jira
and you want to include in the 0.4.0-incubating release.

Thanks
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Review on Jira for 0.4.0-incubating

2016-12-03 Thread Jean-Baptiste Onofré

Hi beamers,

We plan a 0.4.0-incubating release pretty soon. I propose to manage this 
release.


I started to review the Jira with fix version set to 0.4.0-incubating.

Please, update the fix version in Jira if you are working on specific 
Jira and you want to include in the 0.4.0-incubating release.


Thanks
Regards
JB
--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Introduction + new contributions

2016-12-03 Thread Jean-Baptiste Onofré

Welcome aboard !

Regards
JB

On 12/03/2016 03:08 PM, Vladisav Jelisavcic wrote:

Hi,

my name is Vladisav, and I would like to get involved in Apache Beam.
For starters, I'll do something simple, e.g.: BEAM-961

Best regards,
Vladisav



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Unable to clone beam site

2016-12-01 Thread Jean-Baptiste Onofré

Hi Sandeep,

It works fine for me:

git clone https://github.com/apache/incubator-beam-site.git

I have the warning, then go into the the incubator-beam-site folder and do:

git checkout asf-site

Regards
JB

On 12/01/2016 10:14 AM, Sandeep Deshmukh wrote:

Hi,

I am trying to clone beam site but getting following error:

git clone https://github.com/apache/incubator-beam-site.git
Cloning into 'incubator-beam-site'...
remote: Counting objects: 10184, done.
remote: Compressing objects: 100% (116/116), done.
remote: Total 10184 (delta 60), reused 0 (delta 0), pack-reused 10011
Receiving objects: 100% (10184/10184), 28.05 MiB | 265.00 KiB/s, done.
Resolving deltas: 100% (7529/7529), done.
*Checking connectivity... done.*
*warning: remote HEAD refers to nonexistent ref, unable to checkout.*

Attempted it couple of times with same error.

I could clone my fork though.

Regards,
Sandeep



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Add DistributedLog IO

2016-11-30 Thread Jean-Baptiste Onofré

Thanks !

I will do the review today.

Regards
JB

On 11/30/2016 01:04 PM, Khurrum Nasim wrote:

 I just sent a pull request for adding a bounded source to Beam for reading
distributedlog streams - https://github.com/apache/incubator-beam/pull/1464

Appreciate any review comments.

- KN

On Wed, Aug 31, 2016 at 2:10 AM, Jean-Baptiste Onofré 
wrote:


Hi Khurrum,

I already replied in the Jira this morning.

To write the IO, the first question is bounded or unbounded and which
features you want to provide.

An IO could be a wrapper to a simple DoFn.

If you want provide advanced features like:
- watermark/skew management for unbounded source
- estimated size and split for bounded source
then you can use the Source API.

You can take a look on the existing IO:
- JMS, Kafka, PubSub for unbounded
- Bigtable, MongoDB for bounded

We are preparing some documentation on the Beam website about that.

In the mean time, you can take a look on the Dataflow Custom IO
documentation:

https://cloud.google.com/dataflow/model/custom-io-java

It's basically the same as in Beam.

Anyway, please, let me know, I would be more than happy to help you on
this !

I'm looking forward working with you on this !

Regards
JB


On 08/31/2016 11:02 AM, Khurrum Nasim wrote:


Hello beam folks,

We are evaluating a new solution to unify our streaming and batching data
pipeline, from storage, computing engine to programming model. The idea is
basically to implement the Kappa architecture, using DistributedLog as a
unified stream store for both streaming and batching, using Flink or Spark
(still debating) as the process engine, and using Beam as the programming
model.

We'd like to contribute an IO connector to DistributedLog (both bounded
source/sink and unbounded source/sink).

Is there any special instructions or best practise to add a new IO
connector? Any suggestion is very appreciated.

The jira is here: https://issues.apache.org/jira/browse/BEAM-607

Also, /cc the distributed log team for any helps.

KN



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Meet up at Strata+Hadoop World in Singapore

2016-11-30 Thread Jean-Baptiste Onofré

Lucky you guys ;)

Unfortunately, I won't be there.

See you soon.

Regards
JB

On 11/30/2016 08:07 AM, Aljoscha Krettek wrote:

Hi,
I'll also be there to give a talk (and also at the Beam tutorial).

Cheers,
Aljoscha

On Wed, Nov 30, 2016, 00:51 Dan Halperin  wrote:


Hey folks,

Who will be attending Strata+Hadoop World next week in Singapore? Tyler and
I will be there, giving a Beam tutorial [0] and some talks [2,3].

I'd love to sync in person with anyone who wants to talk Beam. Please reach
out to me directly if you'd like to meet.

Thanks!
Dan

[0]

http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54331
[1]

http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54343
[2]

http://conferences.oreilly.com/strata/hadoop-big-data-sg/public/schedule/detail/54325
(Slava Chernyak, our Google colleague)





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré

Hi Jesse,

yes, I started something there (using JAXB and Jackson). Let me polish 
and push.


Regards
JB

On 11/29/2016 10:00 PM, Jesse Anderson wrote:

I went through the string conversions. Do you have an example of writing
out XML/JSON/etc too?

On Tue, Nov 29, 2016 at 3:46 PM Jean-Baptiste Onofré 
wrote:


Hi Jesse,


https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat

it's very simple and stupid and of course not complete at all (I have
other commits but not merged as they need some polishing), but as I
said, it's a base of discussion.

Regards
JB

On 11/29/2016 09:23 PM, Jesse Anderson wrote:

@jb Sounds good. Just let us know once you've pushed.

On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré 
wrote:


Good point Eugene.

Right now, it's a DoFn collection to experiment a bit (a pure
extension). It's pretty stupid ;)

But, you are right, depending the direction of such extension, it could
cover more use cases (even if it's not my first intention ;)).

Let me push the branch (pretty small) as an illustration, and in the
mean time, I'm preparing a document (more focused on the use cases).

WDYT ?

Regards
JB

On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:

Hi JB,
Depending on the scope of what you want to ultimately accomplish with

this

extension, I think it may make sense to write a proposal document and
discuss it.
If it's just a collection of utility DoFn's for various well-defined
source/target format pairs, then that's probably not needed, but if

it's

anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.

On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré 


wrote:


By the way Jesse, I gonna push my DATAFORMAT branch on my github and I
will post on the dev mailing list when done.

Regards
JB

On 11/29/2016 07:01 PM, Jesse Anderson wrote:

I want to bring this thread back up since we've had time to think

about

it

more and make a plan.

I think a format-specific converter will be more time consuming task

than

we originally thought. It'd have to be a writer that takes another

writer

as a parameter.

I think a string converter can be done as a simple transform.

I think we should start with a simple string converter and plan for a
format-specific writer.

What are your thoughts?

Thanks,

Jesse

On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson <

je...@smokinghand.com



wrote:

I was thinking about what the outputs would look like last night. I
realized that more complex formats like JSON and XML may or may not

output

the data in a valid format.

Doing a direct conversion on unbounded collections would work just

fine.

They're self-contained. For writing out bounded collections, that's

where

we'll hit the issues. This changes the uber conversion transform

into a

transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis,

we'd

get

this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things when

writing

out a file. It would need to add brackets for an array. Now we have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma or

have

the uber transform start putting in the commas, except for the last

element.


[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers require a

root

element for everything. The uber transform would have to put the root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang 

wrote:


I would love to see a lean core and abundant Transforms at the same

time.


Maybe we can look at what Confluent <https://github.com/confluentinc



does

for kafka-connect. They have official extensions support for JDBC,

HDFS

and

ElasticSearch under https://github.com/confluentinc. They put them

along

with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not suitable

for

https://github.com/apache/incubator-beam. In the future, we may have
beam-algebra like http://github.com/twitter/algebird for algebra

operations

and beam-ml / beam-dl for machine learning / deep learning. Also,

there

will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles




wrote:


On this point from Amit a

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré

Hi Jesse,

https://github.com/jbonofre/incubator-beam/tree/DATAFORMAT/sdks/java/extensions/dataformat

it's very simple and stupid and of course not complete at all (I have 
other commits but not merged as they need some polishing), but as I 
said, it's a base of discussion.


Regards
JB

On 11/29/2016 09:23 PM, Jesse Anderson wrote:

@jb Sounds good. Just let us know once you've pushed.

On Tue, Nov 29, 2016 at 2:54 PM Jean-Baptiste Onofré 
wrote:


Good point Eugene.

Right now, it's a DoFn collection to experiment a bit (a pure
extension). It's pretty stupid ;)

But, you are right, depending the direction of such extension, it could
cover more use cases (even if it's not my first intention ;)).

Let me push the branch (pretty small) as an illustration, and in the
mean time, I'm preparing a document (more focused on the use cases).

WDYT ?

Regards
JB

On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:

Hi JB,
Depending on the scope of what you want to ultimately accomplish with

this

extension, I think it may make sense to write a proposal document and
discuss it.
If it's just a collection of utility DoFn's for various well-defined
source/target format pairs, then that's probably not needed, but if it's
anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.

On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré 
wrote:


By the way Jesse, I gonna push my DATAFORMAT branch on my github and I
will post on the dev mailing list when done.

Regards
JB

On 11/29/2016 07:01 PM, Jesse Anderson wrote:

I want to bring this thread back up since we've had time to think about

it

more and make a plan.

I think a format-specific converter will be more time consuming task

than

we originally thought. It'd have to be a writer that takes another

writer

as a parameter.

I think a string converter can be done as a simple transform.

I think we should start with a simple string converter and plan for a
format-specific writer.

What are your thoughts?

Thanks,

Jesse

On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson 


wrote:

I was thinking about what the outputs would look like last night. I
realized that more complex formats like JSON and XML may or may not

output

the data in a valid format.

Doing a direct conversion on unbounded collections would work just

fine.

They're self-contained. For writing out bounded collections, that's

where

we'll hit the issues. This changes the uber conversion transform into a
transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis, we'd

get

this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things when

writing

out a file. It would need to add brackets for an array. Now we have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma or

have

the uber transform start putting in the commas, except for the last

element.


[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers require a

root

element for everything. The uber transform would have to put the root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang 

wrote:


I would love to see a lean core and abundant Transforms at the same

time.


Maybe we can look at what Confluent <https://github.com/confluentinc>

does

for kafka-connect. They have official extensions support for JDBC, HDFS

and

ElasticSearch under https://github.com/confluentinc. They put them

along

with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not suitable

for

https://github.com/apache/incubator-beam. In the future, we may have
beam-algebra like http://github.com/twitter/algebird for algebra

operations

and beam-ml / beam-dl for machine learning / deep learning. Also, there
will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles 


wrote:


On this point from Amit and Ismaël, I agree: we could benefit from a

place

for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate

artifacts.

I

think that is fine, considering the nature of Join and SortValues. But

for

simpler transforms, Importing one artifact per tiny transform is too

much

overhead. It also

Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré

Good point Eugene.

Right now, it's a DoFn collection to experiment a bit (a pure 
extension). It's pretty stupid ;)


But, you are right, depending the direction of such extension, it could 
cover more use cases (even if it's not my first intention ;)).


Let me push the branch (pretty small) as an illustration, and in the 
mean time, I'm preparing a document (more focused on the use cases).


WDYT ?

Regards
JB

On 11/29/2016 08:47 PM, Eugene Kirpichov wrote:

Hi JB,
Depending on the scope of what you want to ultimately accomplish with this
extension, I think it may make sense to write a proposal document and
discuss it.
If it's just a collection of utility DoFn's for various well-defined
source/target format pairs, then that's probably not needed, but if it's
anything more, then I think it is.
That will help avoid a lot of churn if people propose reasonable
significant changes.

On Tue, Nov 29, 2016 at 11:15 AM Jean-Baptiste Onofré 
wrote:


By the way Jesse, I gonna push my DATAFORMAT branch on my github and I
will post on the dev mailing list when done.

Regards
JB

On 11/29/2016 07:01 PM, Jesse Anderson wrote:

I want to bring this thread back up since we've had time to think about

it

more and make a plan.

I think a format-specific converter will be more time consuming task than
we originally thought. It'd have to be a writer that takes another writer
as a parameter.

I think a string converter can be done as a simple transform.

I think we should start with a simple string converter and plan for a
format-specific writer.

What are your thoughts?

Thanks,

Jesse

On Thu, Nov 10, 2016 at 10:33 AM Jesse Anderson 
wrote:

I was thinking about what the outputs would look like last night. I
realized that more complex formats like JSON and XML may or may not

output

the data in a valid format.

Doing a direct conversion on unbounded collections would work just fine.
They're self-contained. For writing out bounded collections, that's where
we'll hit the issues. This changes the uber conversion transform into a
transform that needs to be a writer.

If a transform executes a JSON conversion on a per element basis, we'd

get

this:
{
"key": "value"
}, {
"key": "value"
},

That isn't valid JSON.

The conversion transform would need to know do several things when

writing

out a file. It would need to add brackets for an array. Now we have:
[
{
"key": "value"
}, {
"key": "value"
},
]

We still don't have valid JSON. We have to remove the last comma or have
the uber transform start putting in the commas, except for the last

element.


[
{
"key": "value"
}, {
"key": "value"
}
]

Only by doing this do we have valid JSON.

I'd argue we'd have a similar issue with XML. Some parsers require a root
element for everything. The uber transform would have to put the root
element tags at the beginning and end of the file.

On Wed, Nov 9, 2016 at 11:36 PM Manu Zhang 

wrote:


I would love to see a lean core and abundant Transforms at the same time.

Maybe we can look at what Confluent <https://github.com/confluentinc>

does

for kafka-connect. They have official extensions support for JDBC, HDFS

and

ElasticSearch under https://github.com/confluentinc. They put them along
with other community extensions on
https://www.confluent.io/product/connectors/ for visibility.

Although not a commercial company, can we have a GitHub user like
beam-community to host projects we build around beam but not suitable for
https://github.com/apache/incubator-beam. In the future, we may have
beam-algebra like http://github.com/twitter/algebird for algebra

operations

and beam-ml / beam-dl for machine learning / deep learning. Also, there
will will be beam related projects elsewhere maintained by other
communities. We can put all of them on the beam-website or like spark
packages as mentioned by Amit.

My $0.02
Manu



On Thu, Nov 10, 2016 at 2:59 AM Kenneth Knowles 
wrote:


On this point from Amit and Ismaël, I agree: we could benefit from a

place

for miscellaneous non-core helper transformations.

We have sdks/java/extensions but it is organized as separate artifacts.

I

think that is fine, considering the nature of Join and SortValues. But

for

simpler transforms, Importing one artifact per tiny transform is too

much

overhead. It also seems unlikely that we will have enough commonality

among

the transforms to call the artifact anything other than [some synonym

for]

"miscellaneous".

I wouldn't want to take this too far - even though the SDK many

transforms*

that are not required for the model [1], I like that the SDK artifact

has

everything a user might need in their "getting started" phase of use.

This

user-friendliness (the user doesn't care that ParDo is core and Sum is


Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
at this moment it would be better for these

transforms

to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL

or

if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice

to

start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
wrote:


Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the technical

vision

document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:


I think Jesse has a very good point on one hand, while Luke's and
Kenneth's
worries about committing users to specific implementations is in

place.


The Spark community has a 3rd party repository for useful libraries

that

for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles




wrote:

It seems useful for small scale debugging / demoing to have

Dump.toString(). I think it should be named to clearly indicate its
limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is

already

the subject of many mature libraries which are already easy to use

with

Beam.

The more esoteric practice of implicit or semi-implicit coercions

seems

like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and

we

shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik




wrote:

The suggestions you give seem good except for the the XML cases.


Might want to have the XML be a document per line similar to the

JSON

examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <

je...@smokinghand.com>

wrote:

@lukasz Agreed there would have to be KV handling. I was more think



that



whatever the addition, it shouldn't just handle KV. It should

handle

Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone

something

general purpose enough that you would just end up writing your own

code



to


handle it anyway.

Here are some ideas on what it could look like with a method and

the

resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:


With Iterables:

  one
  two
  three


*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable

code

and

writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik




wrote:

Jesse, I believe if one format gets special treatment in TextIO,

people

will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the

input



format


is an Iterable. You had posted a question about using KV

with

TextIO.Write which wouldn't align with the proposed input format

and



still


would require to write a type conversion function, this time from

KV

to

Iterable instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <

je...@smokinghand.com>

wrote:

Lukasz,


I don't think you'd need complicated logic for TextIO.Write. For

CSV



the



call would look like:

Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix, delimiter,

suffix).


The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other formats


without



complicated logic. The same sort of thing could be done for



TextIO.Write.





Thanks,

Jesse

On Tue, Nov 8, 2016 at 10:30 AM Lukasz Cwik






wrote:


The conversion from object to string will have uses outside of

just

TextIO.Write so it seems logical that we would 

Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-29 Thread Jean-Baptiste Onofré

Hi Pei,

rethinking about that, I understand that the purpose of the Beam 
filesystem is to avoid to bring a bunch of dependencies into the core. 
That makes perfect sense.


So, I agree that a Beam filesystem abstract is fine.

My point is that we should provide a HadoopFilesystem extension/plugin 
for Beam filesystem asap: that would help us to support a good range of 
filesystems quickly.


Just my $0.01 ;)

Regards
JB

On 11/17/2016 08:18 PM, Pei He wrote:

Hi JB,
My proposals are based on the current IOChannelFactory, and how they are
used in FileBasedSink.

Let's me spend more time to investigate Hadoop FileSystem interface.
--
Pei

On Thu, Nov 17, 2016 at 1:21 AM, Jean-Baptiste Onofré 
wrote:


By the way, Pei, for the record: why introducing BeamFileSystem and not
using the Hadoop FileSystem interface ?

Thanks
Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:


Hi,

I am working on BEAM-59
<https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign
<https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJ
sVG3qel2lhdKTknmZ_7M/edit#>

Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem
<https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4
q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>

Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PCollection to PCollection Conversion

2016-11-29 Thread Jean-Baptiste Onofré
ransforms

that are not core enough to be part of the sdk, but that we all end up
re-writing somehow.

This is a needed improvement to be more developer friendly, but also as

a

reference of good practices of Beam development, and for this reason I
agree with JB that at this moment it would be better for these

transforms

to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL

or

if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice

to

start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
wrote:


Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the technical

vision

document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:


I think Jesse has a very good point on one hand, while Luke's and
Kenneth's
worries about committing users to specific implementations is in

place.


The Spark community has a 3rd party repository for useful libraries

that

for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles




wrote:

It seems useful for small scale debugging / demoing to have

Dump.toString(). I think it should be named to clearly indicate its
limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is

already

the subject of many mature libraries which are already easy to use

with

Beam.

The more esoteric practice of implicit or semi-implicit coercions

seems

like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and

we

shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik




wrote:

The suggestions you give seem good except for the the XML cases.


Might want to have the XML be a document per line similar to the

JSON

examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson <

je...@smokinghand.com>

wrote:

@lukasz Agreed there would have to be KV handling. I was more think



that



whatever the addition, it shouldn't just handle KV. It should

handle

Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone

something

general purpose enough that you would just end up writing your own

code



to


handle it anyway.

Here are some ideas on what it could look like with a method and

the

resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:


With Iterables:

  one
  two
  three


*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable

code

and

writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik




wrote:

Jesse, I believe if one format gets special treatment in TextIO,

people

will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the

input



format


is an Iterable. You had posted a question about using KV

with

TextIO.Write which wouldn't align with the proposed input format

and



still


would require to write a type conversion function, this time from

KV

to

Iterable instead of KV to string.

On Tue, Nov 8, 2016 at 9:50 AM, Jesse Anderson <

je...@smokinghand.com>

wrote:

Lukasz,


I don't think you'd need complicated logic for TextIO.Write. For

CSV



the



call would look like:

Stringify.to("", ",", "\n");

Where the arguments would be Stringify.to(prefix, delimiter,

suffix).


The code would be something like:
StringBuffer buffer = new StringBuffer(prefix);

for (Item item : list) {
  buffer.append(item.toString());

  if(notLast) {
buffer.append(delimiter);
  }
}

buffer.append(suffix);

c.output(buffer.toString());

That would allow you to do the basic CSV, TSV, and other formats


without



co

Re: DoFn relying on Microservices

2016-11-25 Thread Jean-Baptiste Onofré

Hi Sergio,

By the way, you can also use TensorFrame allowing you to use TensorFlow 
directly with Spark dataframe, and more direct access. I discussed with 
Tim Hunter from Databricks about that who's working on TensorFrame.


Back on Beam, what you could do:

1. you expose the service on a microservice container (for instance 
Apache Karaf ;))

In your pipeline, you have two options:
2.a. in your Beam pipeline, in a DoFn, in the @Setup you can create the 
REST client (using CXF, or whatever), and in the @ProcessElement you can 
use the service (hosted by Karaf)
2.b. I also have a RestIO (source and sink) that can request a REST 
endpoint. However, for now, this IO acts as a pipeline endpoint 
(PTransform or PTransform). In 
your case, if the service called is a step of your pipeline, ParDo(your 
DoFn) would be easier.


Is it what you mean by microservice ?

Regards
JB

On 11/25/2016 01:18 PM, Sergio Fernández wrote:

Hi JB,

On Tue, Nov 22, 2016 at 11:14 AM, Jean-Baptiste Onofré 
wrote:


DoFn will execute per element (with eventually a hook on StartBundle,
FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: we
create the connection in StartBundle and send each element (with a batch)
to external resource.

PTransform is maybe more flexible in case of interact with "outside"
resources.



Probably PTransform would be a better place. I'm still pretty new to some
of the Beam terms and apis.

Do you have use case to be sure I understand ?


Yes, Well, it's far more complex, but this question I can simplify it:

We have a TensorFlow-based classifier. In our pipeline one step performs
that classification of the data. Currently it's implemented as a Spark
Function, because TensorFlow models can directly be embedded within
pipelines using PySpark.

Therefore I'm looking for the best option to move such classification
process one level up in the abstraction with Beam, so I could make it
portable. The first idea I'm exploring is relying on a external function
(i.e., microservice) that I'd need to scale up and down independently of
the pipeline. So I'm more than welcome to discuss ideas ;-)

Thanks.

Cheers,




On 11/22/2016 10:39 AM, Sergio Fernández wrote:


Hi,

I'd like resume the idea to have TensorFlow-based tasks running in a Beam
Pipeline. So far the cleaner approach I can imagine would be to have it
running outside (Functions in GCP, Lambdas in AWS, Microservices generally
speaking).

Therefore, does the current Beam model provide the sense of a DoFn which
actually runs externally?

Thanks in advance for the feedback.

Cheers,



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com

--
<http://www.talend.com>
<http://www.talend.com>
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e:  <http://www.talend.com>sergio.fernan...@redlink.co
w: http://redlink.co





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Hosting data stores for IO Transform testing

2016-11-23 Thread Jean-Baptiste Onofré
mael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela 
wrote:


Hi Stephen,

I was wondering about how we plan to use the data stores across

executions.

Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and
once
the test is done, teardown the instance. It should also be agnostic to

the

runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and

more

test cases introduced.

Another way to go would be to have small clusters of different data

stores

and run against new "namespaces" (while lazily evicting old ones),
but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it
(such as

an

embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration,
but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:


Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I

like

it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but
it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms -

those

tend to run against in-memory versions of the data stores. However,

we'd

like to further increase our test coverage to include running them

against

real instances of the data stores that the IO Transforms work against

(e.g.

cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave

realistically,

which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept this proposal, we would create an infrastructure for

running

real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm,

etc…

to

manage the instances.

This would enable us to build integration tests that run against

those

real

instances and performance tests that run against those real instances

(like

those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having

various

people host their own instances?
-
Reducing flakiness of tests is key. By not having dependencies from

the

core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that

arise.

An exception would be something that has an ops team supporting it

(eg,

AWS, Google Cloud or other professionally managed service) - those we

trust

will be stable.


There may be a lot of different data stores needed - how will we

maintain

them?
-
It will take work above and beyond that of a normal set of unit tests

to

build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store

instances

on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers -

expecting

checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members

of

the

community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do

that.

If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then

we

can

disable those tests. We may document on the website what IO

Transforms

have

current integration/performance tests so users know what level of

testing

the various IO Transforms have.


What about requirements for the container management software itself?
-
* We should have the data store instances themselves in Docker.

Docker

allows new instances to be spun up in a quick, reproducible way and

is

fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work requi

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Jean-Baptiste Onofré

Hi Ismaël,

FYI, we also test the IOs on spark and flink small clusters (not yet 
apex): it's where I'm using Mesos/Marathon.


It's not a large cluster, but the integration tests are performed (by 
hand) on clusters.


We already discussed with Stephan and Jason to use Marathon JSON and 
Mesos docker images bootstrapped by Jenkins for the itests.


Regards
JB

On 11/22/2016 04:58 PM, Ismaël Mejía wrote:

​Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:


Hi Stephen,

I was wondering about how we plan to use the data stores across executions.

Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to the
runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and more
test cases introduced.

Another way to go would be to have small clusters of different data stores
and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as an
embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:


Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I like
it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms -

those

tend to run against in-memory versions of the data stores. However,

we'd

like to further increase our test coverage to include running them

against

real instances of the data stores that the IO Transforms work against

(e.g.

cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept th

Re: Hosting data stores for IO Transform testing

2016-11-22 Thread Jean-Baptiste Onofré

Hi Sourabh,

We raised the IO versioning point couple of months ago on the mailing list.

Basically, we have two options:

1. Same modules (for example sdks/java/io/kafka) with one branch per 
version (kafka-0.8 kafka-0.10)

2. Several modules: sdks/java/io/kafka-0.8 sdks/java/io/kafka-0.10

My preferences is on 2:
Pros:
- the IO can still be part of the main Beam release
- it's more visible for contribution
Cons:
- we might have code duplication

Regards
JB

On 11/22/2016 08:12 PM, Sourabh Bajaj wrote:

Hi,

One tangential question I had around the proposal was how do we currently
deal with versioning in IO sources/sinks.

For example Cassandra 1.2 vs 2.1 have some differences between them, so the
checked in sources and sink probably supports a particular version right
now. If yes, follow questions would be around how do we handle updating ?
deprecating and documenting the supported versions.

I can move this to a new thread if this seems like a different discussion.
Also if this has already been answered please feel free to direct me to a
doc or past thread.

Thanks
Sourabh

On Tue, Nov 22, 2016 at 7:59 AM Ismaël Mejía  wrote:


​Hello,

@Stephen Thanks for your proposal, it is really interesting, I would really
like to help with this. I have never played with Kubernetes but this seems
a really nice chance to do something useful with it.

We (at Talend) are testing most of the IOs using simple container images
and in some particular cases ‘clusters’ of containers using docker-compose
(a little bit like Amit’s (2) proposal). It would be really nice to have
this at the Beam level, in particular to try to test more complex
semantics, I don’t know how programmable kubernetes is to achieve this for
example:

Let’s think we have a cluster of Cassandra or Kafka nodes, I would like to
have programmatic tests to simulate failure (e.g. kill a node), or simulate
a really slow node, to ensure that the IO behaves as expected in the Beam
pipeline for the given runner.

Another related idea is to improve IO consistency: Today the different IOs
have small differences in their failure behavior, I really would like to be
able to predict with more precision what will happen in case of errors,
e.g. what is the correct behavior if I am writing to a Kafka node and there
is a network partition, does the Kafka sink retries or no ? and what if it
is the JdbcIO ?, will it work the same e.g. assuming checkpointing? Or do
we guarantee exactly once writes somehow?, today I am not sure about what
happens (or if the expected behavior depends on the runner), but well maybe
it is just that I don’t know and we have tests to ensure this.

Of course both are really hard problems, but I think with your proposal we
can try to tackle them, as well as the performance ones. And apart of the
data stores, I think it will be also really nice to be able to test the
runners in a distributed manner.

So what is the next step? How do you imagine such integration tests? ? Who
can provide the test machines so we can mount the cluster?

Maybe my ideas are a bit too far away for an initial setup, but it will be
really nice to start working on this.

Ismael​


On Tue, Nov 22, 2016 at 11:00 AM, Amit Sela  wrote:


Hi Stephen,

I was wondering about how we plan to use the data stores across

executions.


Clearly, it's best to setup a new instance (container) for every test,
running a "standalone" store (say HBase/Cassandra for example), and once
the test is done, teardown the instance. It should also be agnostic to

the

runtime environment (e.g., Docker on Kubernetes).
I'm wondering though what's the overhead of managing such a deployment
which could become heavy and complicated as more IOs are supported and

more

test cases introduced.

Another way to go would be to have small clusters of different data

stores

and run against new "namespaces" (while lazily evicting old ones), but I
think this is less likely as maintaining a distributed instance (even a
small one) for each data store sounds even more complex.

A third approach would be to to simply have an "embedded" in-memory
instance of a data store as part of a test that runs against it (such as

an

embedded Kafka, though not a data store).
This is probably the simplest solution in terms of orchestration, but it
depends on having a proper "embedded" implementation for an IO.

Does this make sense to you ? have you considered it ?

Thanks,
Amit

On Tue, Nov 22, 2016 at 8:20 AM Jean-Baptiste Onofré 
wrote:


Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I

like

it as a both integration test platform and good coverage for IOs.

I'm very late on this but, as said, I will share with you my Marathon
JSON and Mesos docker images.

By the way, I started to experiment a bit kubernetes and swamp but it's
not yet complete. I will share what I have on the same github repo.

Thanks !
Regards
JB

On 11/16

Re: [DISCUSS] Graduation to a top-level project

2016-11-22 Thread Jean-Baptiste Onofré

Hi,

First of all, I would like to thank the whole team, and especially Davor 
for the great work and commitment to Apache and the community.


Of course, a big +1 to move forward on graduation !

Regards
JB

On 11/22/2016 07:19 PM, Davor Bonaci wrote:

Hi everyone,
With all the progress we’ve had recently in Apache Beam, I think it is time
we start the discussion about graduation as a new top-level project at the
Apache Software Foundation.

Graduation means we are a self-sustaining and self-governing community, and
ready to be a full participant in the Apache Software Foundation. It does
not imply that our community growth is complete or that a particular level
of technical maturity has been reached, rather that we are on a solid
trajectory in those areas. After graduation, we will still periodically
report to, and be overseen by, the ASF Board to ensure continued growth of
a healthy community.

Graduation is an important milestone for the project. It is also key to
further grow the user community: many users (incorrectly) see incubation as
a sign of instability and are much less likely to consider us for a
production use.

A way to think about graduation readiness is through the Apache Maturity
Model [1]. I think we clearly satisfy all the requirements [2]. It is
probably worth emphasizing the recent community growth: over each of the
past three months, no single organization contributing to Beam has had more
than ~50% of the unique contributors per month [2, see assumptions]. That’s
a great statistic that shows how much we’ve grown our diversity!

Process-wise, graduation consists of drafting a board resolution, which
needs to identify the full Project Management Committee, and getting it
approved by the community, the Incubator, and the Board. Within the Beam
community, most of these discussions and votes have to be on the private@
mailing list, but, as usual, we’ll try to keep dev@ updated as much as
possible.

With that in mind, let’s use this discussion on dev@ for two things:
* Collect additional data points on our progress that we may want to
present to the Incubator as a part of the proposal to accept our graduation.
* Determine whether the community supports graduation. Please reply +1/-1
with any additional comments, as appropriate. I’d encourage everyone to
participate -- regardless whether you are an occasional visitor or have a
specific role in the project -- we’d love to hear your perspective.

Data points so far:
* Project’s maturity self-assessment [2].
* 1500 pull requests in incubation, which makes us one of the most active
project across all of ASF on this metric.
* 3 releases, each driven by a different release manager.
* 120+ individual contributors.
* 3 new committers added, 2 of which aren’t from the largest organization.
* 1027 issues created, 515 resolved.
* 442 dev@ emails in October alone, sent by 51 individuals.
* 50 user@ emails in the last 30 days, sent by 22 individuals.

Thanks!

Davor

[1] http://community.apache.org/apache-way/apache-project-
maturity-model.html
[2] http://beam.incubator.apache.org/contribute/maturity-model/



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: DoFn relying on Microservices

2016-11-22 Thread Jean-Baptiste Onofré

Hi Sergio,

DoFn will execute per element (with eventually a hook on StartBundle, 
FinishBundle, and Teardown). It's basic the way it works in IO WriteFn: 
we create the connection in StartBundle and send each element (with a 
batch) to external resource.


PTransform is maybe more flexible in case of interact with "outside" 
resources.


Do you have use case to be sure I understand ?

Thanks !
Regards
JB

On 11/22/2016 10:39 AM, Sergio Fernández wrote:

Hi,

I'd like resume the idea to have TensorFlow-based tasks running in a Beam
Pipeline. So far the cleaner approach I can imagine would be to have it
running outside (Functions in GCP, Lambdas in AWS, Microservices generally
speaking).

Therefore, does the current Beam model provide the sense of a DoFn which
actually runs externally?

Thanks in advance for the feedback.

Cheers,



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Hosting data stores for IO Transform testing

2016-11-21 Thread Jean-Baptiste Onofré

Hi Stephen,

as already discussed a bit together, it sounds great ! Especially I like 
it as a both integration test platform and good coverage for IOs.


I'm very late on this but, as said, I will share with you my Marathon 
JSON and Mesos docker images.


By the way, I started to experiment a bit kubernetes and swamp but it's 
not yet complete. I will share what I have on the same github repo.


Thanks !
Regards
JB

On 11/16/2016 11:36 PM, Stephen Sisk wrote:

Hi everyone!

Currently we have a good set of unit tests for our IO Transforms - those
tend to run against in-memory versions of the data stores. However, we'd
like to further increase our test coverage to include running them against
real instances of the data stores that the IO Transforms work against (e.g.
cassandra, mongodb, kafka, etc…), which means we'll need to have real
instances of various data stores.

Additionally, if we want to do performance regression detection, it's
important to have instances of the services that behave realistically,
which isn't true of in-memory or dev versions of the services.


Proposed solution
-
If we accept this proposal, we would create an infrastructure for running
real instances of data stores inside of containers, using container
management software like mesos/marathon, kubernetes, docker swarm, etc… to
manage the instances.

This would enable us to build integration tests that run against those real
instances and performance tests that run against those real instances (like
those that Jason Kuster is proposing elsewhere.)


Why do we need one centralized set of instances vs just having various
people host their own instances?
-
Reducing flakiness of tests is key. By not having dependencies from the
core project on external services/instances of data stores we have
guaranteed access to the services and the group can fix issues that arise.

An exception would be something that has an ops team supporting it (eg,
AWS, Google Cloud or other professionally managed service) - those we trust
will be stable.


There may be a lot of different data stores needed - how will we maintain
them?
-
It will take work above and beyond that of a normal set of unit tests to
build and maintain integration/performance tests & their data store
instances.

Setup & maintenance of the data store containers and data store instances
on it must be automated. It also has to be as simple of a setup as
possible, and we should avoid hand tweaking the containers - expecting
checked in scripts/dockerfiles is key.

Aligned with the community ownership approach of Apache, as members of the
community are excited to contribute & maintain those tests and the
integration/performance tests, people will be able to step up and do that.
If there is no longer support for maintaining a particular set of
integration & performance tests and their data store instances, then we can
disable those tests. We may document on the website what IO Transforms have
current integration/performance tests so users know what level of testing
the various IO Transforms have.


What about requirements for the container management software itself?
-
* We should have the data store instances themselves in Docker. Docker
allows new instances to be spun up in a quick, reproducible way and is
fairly platform independent. It has wide support from a variety of
different container management services.
* As little admin work required as possible. Crashing instances should be
restarted, setup should be simple, everything possible should be
scripted/scriptable.
* Logs and test output should be on a publicly available website, without
needing to log into test execution machine. Centralized capture of
monitoring info/logs from instances running in the containers would support
this. Ideally, this would just be supported by the container software out
of the box.
* It'd be useful to have good persistent volume in the container management
software so that databases don't have to reload large data sets every time.
* The containers may be a place to execute runners themselves if we need
larger runner instances, so it should play well with Spark, Flink, etc…

As I discussed earlier on the mailing list, it looks like hosting docker
containers on kubernetes, docker swarm or mesos+marathon would be a good
solution.

Thanks,
Stephen Sisk



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-18 Thread Jean-Baptiste Onofré

Hi Pei,

Reading the documents, for the part 1, I think that using Hadoop filesystem:

https://hadoop.apache.org/docs/r2.7.3/api/org/apache/hadoop/fs/FileSystem.html

would make more sense than introducing the BeamFileSystem interface.

It would allow us to directly support HDFS, FTP, Azure, S3 out of the 
box (as Hadoop FileSystem provide sub-classes for those providers).


We could provide a GsFileSystem as sub-class of Hadoop Filesystem.

The part 2 is OK in term of configuration.

Let me know if I can work with you on this (in term of implementation).

Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:

Hi,

I am working on BEAM-59
<https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign
<https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>

Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem
<https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>

Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Apache repo connection errors eat up a lot of build time

2016-11-17 Thread Jean-Baptiste Onofré
I checked on status.apache.org: I can see some "Connection refused" 
error on November 17 and November 16. It's maybe related.


Regards
JB

On 11/18/2016 12:52 AM, Eugene Kirpichov wrote:

E.g. this job: https://travis-ci.org/apache/incubator-beam/jobs/176795223
 or https://travis-ci.org/apache/incubator-beam/jobs/176795401
Search for "Connect to repository.apache.org:443" - log messages around
these take up several minutes every time it happens, and it happens
multiple times per build.

Is this a failure that Apache must take care of? Are we ourselves not
caching Maven artifacts efficiently?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Apache repo connection errors eat up a lot of build time

2016-11-17 Thread Jean-Baptiste Onofré
Hi Eugene

Thanks for the report. Let me check the Nexus state.

Regards
JB

⁣​

On Nov 18, 2016, 00:53, at 00:53, Eugene Kirpichov 
 wrote:
>E.g. this job:
>https://travis-ci.org/apache/incubator-beam/jobs/176795223
> or https://travis-ci.org/apache/incubator-beam/jobs/176795401
>Search for "Connect to repository.apache.org:443" - log messages around
>these take up several minutes every time it happens, and it happens
>multiple times per build.
>
>Is this a failure that Apache must take care of? Are we ourselves not
>caching Maven artifacts efficiently?


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-17 Thread Jean-Baptiste Onofré
By the way, Pei, for the record: why introducing BeamFileSystem and not 
using the Hadoop FileSystem interface ?


Thanks
Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:

Hi,

I am working on BEAM-59
<https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign
<https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>

Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem
<https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>

Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [PROPOSAL] "IOChannelFactory" Redesign and Make it Configurable

2016-11-17 Thread Jean-Baptiste Onofré

Hi Pei,

Thanks for sharing.

For the goals, I fully agree with you: as already discussed, the purpose 
is to have "pluggable" filesystems that will allow us to easily with 
local, gs, hdfs, s3 filesystems (and even more).


After a quick first glance, it looks good to me. I will try to evaluate 
the impact later today.


IMHO, once this change is done, the HdfsIO (in the sdk/java/io) should 
be flagged as deprecated.


Regards
JB

On 11/17/2016 01:09 AM, Pei He wrote:

Hi,

I am working on BEAM-59
<https://issues.apache.org/jira/browse/BEAM-59> "IOChannelFactory
redesign". The goals are:

1. Support file-based IOs (TextIO, AvorIO) with user-defined file system.

2. Support configuring any user-defined file system.

And, I drafted the design proposal in two parts to address them in order:

Part 1: IOChannelFactory Redesign
<https://docs.google.com/document/d/11TdPyZ9_zmjokhNWM3Id-XJsVG3qel2lhdKTknmZ_7M/edit#>

Summary:

Old API: WritableByteChannel create(String spec, String mimeType);

New API: WritableByteChannel create(URI uri, CreateOptions options);

Noticeable proposed changes:


   1.

   Includes the options parameter in most methods to specify behaviors.
   2.

   Replace String with URI to include scheme for files/directories
   locations.
   3.

   Require file systems to provide a SeekableByteChannel for read.
   4.

   Additional methods, such as getMetadata(), rename() e.t.c


Part 2: Configurable BeamFileSystem
<https://docs.google.com/document/d/1-7vo9nLRsEEzDGnb562PuL4q9mUiq_ZVpCAiyyJw8p8/edit#heading=h.p3gc3colc2cs>

Summary:

Old API: IOChannelUtils.getFactory(glob).match(glob);

New API: BeamFileSystems.getFileSystem(glob, config).match(glob);


Looking for comments and feedback.

Thanks

--

Pei



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Including Apex runner in Beam tutorial at Strata - Singapore

2016-11-15 Thread Jean-Baptiste Onofré

Hi Sandeep,

Great news !

Yes, you can definitely do a demo using the Apex runner. It's what Dan 
and I are also planning during ApacheCon this week: same Wordcount 
example running on different execution engines.


Maybe this blog could help you to prepare the demo: 
http://blog.nanthrax.net/2016/08/apache-beam-in-action-same-code-several-execution-engines/


By the way, I will propose a PR to "merge" those blog to Beam website.

Regards
JB

On 11/15/2016 04:00 PM, Sandeep Deshmukh wrote:

Dear Beam Community,

There is a Beam tutorial in Strata-Singapore. I would like to explore
possibility of including the Apex runner as a part of that tutorial. As
Apex runner is recently merged into master branch of Beam, it would be of
interest to many people.

Please let us know if we can do so. I can accordingly work on the same.

Regards,
Sandeep



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Configuring Jenkins

2016-11-15 Thread Jean-Baptiste Onofré

Fantastic Davor !

I like this approach, I gonna take a deeper look.

Thanks !

Regards
JB

On 11/15/2016 10:01 AM, Davor Bonaci wrote:

Hi everybody,
As I'm sure everybody knows, we use Apache's Jenkins instance for all our
testing, including pre-commit, post-commit, nightly snapshot, etc. (Travis
CI is a backup system and recommended for individual forks only.)

Managing Jenkins projects has been a big pain point so far. Among other
reasons, only a few of us have access to configure it, way too few of us
have visibility into what those jobs do, and nobody has any visibility into
changes being made or an opportunity to comment on them.

Well, not any more! I was playing a little bit with Jenkins DSL plugin and
was able to move our configuration out of Jenkins and into the git
repository. I've done it as a proof of concept for the website repository
only [1], but Jason is planning on extending that work to the main
repository. Look for a PR shortly!

Going forward, anyone can see what our Jenkins jobs are doing, and anyone
can add new jobs or improve existing ones by simply proposing a pull
request to change the configuration. Finally, the project maintains a
history in source repository, instead of direct changes without much
accountability.

How this works? There's a "seed" job that periodically applies
configuration specified in the source repository into Jenkins. Currently,
this happens once per day. If you modify the configuration in the source
repository, it will be applied within 24 hours. If you, however, modify the
configuration in Jenkins directly, it will revert back to whatever is
specified in the code repository also within 24 hours.

How to understand Jenkins DSL? There are many resources available; I've
found Jenkins Job DSL API [2] particularly helpful.

I hope you are excited to have this feature available to us! If you have
any thoughts on improving this further, please comment. Thanks!

Davor

[1] https://github.com/apache/incubator-beam-site/pull/80
[2] https://jenkinsci.github.io/job-dsl-plugin/



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


ApacheCon: Apache Beam BoF and Beam diner

2016-11-14 Thread Jean-Baptiste Onofré

Following Sergio's idea, we added a BoF space: Apache Beam and You!

Please, if you want to discuss about Beam, have some details, mostly 
community oriented, don't hesitate to join !


See you tonight.

On the other hand, we plan to have an informal Apache Beam diner 
tomorrow evening. If you want to share beers and tapas while discussing 
about Beam, ping Dan or I !


Regards
JB

On 11/14/2016 12:48 PM, Sergio Fernández wrote:

I've just discussed it with Ismael: maybe we can organize a Beam BoF:
https://apachebigdataeu2016.sched.org/event/8giP
What do you think?

On Sun, Nov 13, 2016 at 10:00 PM, Neelesh Salian 
wrote:


Anyone here today?

On Nov 11, 2016 12:01 PM, "Stephan Ewen"  wrote:


I'll also be in Sevilla Monday and Tuesday morning and happy to meet.

Stephan


On Fri, Nov 11, 2016 at 11:55 AM, Jean-Baptiste Onofré 
wrote:


Cool !!

See you there !

Regards
JB


On 11/11/2016 11:42 AM, Neelesh Salian wrote:


I'm getting there on Sunday and will be there all week.
I have a Spark talk on Thursday.
See you folks there. :)

On Nov 11, 2016 11:25 AM, "Jean-Baptiste Onofré" 

wrote:


Hi Sergio,


"Going Under The Hood with Apache Beam" was a mistake, the schedule
should
be updated. Sorry about that.

So, we have two talks about Beam: Introduction and Scio.

See you in Sevilla !

Regards
JB

On 11/11/2016 09:54 AM, Sergio Fernández wrote:

Hi guys,


I'll be there for the whole Apache Big Data, so it'll be great to

meet

you
there!

As far as I can seem, there are three Beam-focused talks:

* Introduction to Apache Beam - Jean-Baptiste Onofré, Apache

Software

Foundation & Dan Halperin, Google  (16th at 11h)

* Going Under the Hood with Apache Beam - Siobhan Lyons (16th at

12h)


* Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify (16th

at

13h)

Cheers,



On Fri, Nov 11, 2016 at 8:11 AM, Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:

Hi guys,



thanks Dan for the e-mail !

On my side, I will be in Sevilla from Monday to Thursday.

Let's meet all together and share some bears ;)

Regards
JB


On 11/11/2016 07:47 AM, Dan Halperin wrote:

Hey folks,



Who will be attending Apache Big Data / ApacheCon next week in
Sevilla?
JB
and I will be there to give a Beam talk Wednesday morning; I'm

around

all
week.

I'd love to sync in person with anyone who wants to talk Beam.

Please

reach
out to me directly if you'd like to meet.

Thanks!
Dan


--


Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com






--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com











--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Meet up at ApacheCon Seville

2016-11-14 Thread Jean-Baptiste Onofré

Hi Sergio,

good idea ! And it seems you already registered: great !

There are already some Beamers today ;)

Thanks !
Regards
JB

On 11/14/2016 12:48 PM, Sergio Fernández wrote:

I've just discussed it with Ismael: maybe we can organize a Beam BoF:
https://apachebigdataeu2016.sched.org/event/8giP
What do you think?

On Sun, Nov 13, 2016 at 10:00 PM, Neelesh Salian 
wrote:


Anyone here today?

On Nov 11, 2016 12:01 PM, "Stephan Ewen"  wrote:


I'll also be in Sevilla Monday and Tuesday morning and happy to meet.

Stephan


On Fri, Nov 11, 2016 at 11:55 AM, Jean-Baptiste Onofré 
wrote:


Cool !!

See you there !

Regards
JB


On 11/11/2016 11:42 AM, Neelesh Salian wrote:


I'm getting there on Sunday and will be there all week.
I have a Spark talk on Thursday.
See you folks there. :)

On Nov 11, 2016 11:25 AM, "Jean-Baptiste Onofré" 

wrote:


Hi Sergio,


"Going Under The Hood with Apache Beam" was a mistake, the schedule
should
be updated. Sorry about that.

So, we have two talks about Beam: Introduction and Scio.

See you in Sevilla !

Regards
JB

On 11/11/2016 09:54 AM, Sergio Fernández wrote:

Hi guys,


I'll be there for the whole Apache Big Data, so it'll be great to

meet

you
there!

As far as I can seem, there are three Beam-focused talks:

* Introduction to Apache Beam - Jean-Baptiste Onofré, Apache

Software

Foundation & Dan Halperin, Google  (16th at 11h)

* Going Under the Hood with Apache Beam - Siobhan Lyons (16th at

12h)


* Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify (16th

at

13h)

Cheers,



On Fri, Nov 11, 2016 at 8:11 AM, Jean-Baptiste Onofré <

j...@nanthrax.net>

wrote:

Hi guys,



thanks Dan for the e-mail !

On my side, I will be in Sevilla from Monday to Thursday.

Let's meet all together and share some bears ;)

Regards
JB


On 11/11/2016 07:47 AM, Dan Halperin wrote:

Hey folks,



Who will be attending Apache Big Data / ApacheCon next week in
Sevilla?
JB
and I will be there to give a Beam talk Wednesday morning; I'm

around

all
week.

I'd love to sync in person with anyone who wants to talk Beam.

Please

reach
out to me directly if you'd like to meet.

Thanks!
Dan


--


Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com






--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com











--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Introduction + contributing to docs

2016-11-11 Thread Jean-Baptiste Onofré

Hi Melissa,

welcome aboard !!

Regards
JB

On 11/11/2016 08:11 PM, Melissa Pashniak wrote:

Hello!


My name is Melissa. I’ve previously been involved with Dataflow
documentation, and I’m excited to start contributing to the Beam project
and documentation.


I’ve written up some text for Beam’s direct runner and Cloud Dataflow
runner pages, currently available in pull requests [1][2]. I am also
working on the unfinished parts of the programming guide [3]. Let me know
if you have any thoughts or feedback.

I look forward to working with everyone in the community!

Melissa


[1] https://github.com/apache/incubator-beam-site/pull/76
[2] https://github.com/apache/incubator-beam-site/pull/77
[3] https://issues.apache.org/jira/browse/BEAM-193



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Meet up at ApacheCon Seville

2016-11-11 Thread Jean-Baptiste Onofré

Cool !!

See you there !

Regards
JB

On 11/11/2016 11:42 AM, Neelesh Salian wrote:

I'm getting there on Sunday and will be there all week.
I have a Spark talk on Thursday.
See you folks there. :)

On Nov 11, 2016 11:25 AM, "Jean-Baptiste Onofré"  wrote:


Hi Sergio,

"Going Under The Hood with Apache Beam" was a mistake, the schedule should
be updated. Sorry about that.

So, we have two talks about Beam: Introduction and Scio.

See you in Sevilla !

Regards
JB

On 11/11/2016 09:54 AM, Sergio Fernández wrote:


Hi guys,

I'll be there for the whole Apache Big Data, so it'll be great to meet you
there!

As far as I can seem, there are three Beam-focused talks:

* Introduction to Apache Beam - Jean-Baptiste Onofré, Apache Software
Foundation & Dan Halperin, Google  (16th at 11h)

* Going Under the Hood with Apache Beam - Siobhan Lyons (16th at 12h)

* Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify (16th at
13h)

Cheers,



On Fri, Nov 11, 2016 at 8:11 AM, Jean-Baptiste Onofré 
wrote:

Hi guys,


thanks Dan for the e-mail !

On my side, I will be in Sevilla from Monday to Thursday.

Let's meet all together and share some bears ;)

Regards
JB


On 11/11/2016 07:47 AM, Dan Halperin wrote:

Hey folks,


Who will be attending Apache Big Data / ApacheCon next week in Sevilla?
JB
and I will be there to give a Beam talk Wednesday morning; I'm around
all
week.

I'd love to sync in person with anyone who wants to talk Beam. Please
reach
out to me directly if you'd like to meet.

Thanks!
Dan


--

Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Meet up at ApacheCon Seville

2016-11-11 Thread Jean-Baptiste Onofré

Hi Sergio,

"Going Under The Hood with Apache Beam" was a mistake, the schedule 
should be updated. Sorry about that.


So, we have two talks about Beam: Introduction and Scio.

See you in Sevilla !

Regards
JB

On 11/11/2016 09:54 AM, Sergio Fernández wrote:

Hi guys,

I'll be there for the whole Apache Big Data, so it'll be great to meet you
there!

As far as I can seem, there are three Beam-focused talks:

* Introduction to Apache Beam - Jean-Baptiste Onofré, Apache Software
Foundation & Dan Halperin, Google  (16th at 11h)

* Going Under the Hood with Apache Beam - Siobhan Lyons (16th at 12h)

* Scio, a Scala DSL for Apache Beam - Robert Gruener, Spotify (16th at 13h)

Cheers,



On Fri, Nov 11, 2016 at 8:11 AM, Jean-Baptiste Onofré 
wrote:


Hi guys,

thanks Dan for the e-mail !

On my side, I will be in Sevilla from Monday to Thursday.

Let's meet all together and share some bears ;)

Regards
JB


On 11/11/2016 07:47 AM, Dan Halperin wrote:


Hey folks,

Who will be attending Apache Big Data / ApacheCon next week in Sevilla? JB
and I will be there to give a Beam talk Wednesday morning; I'm around all
week.

I'd love to sync in person with anyone who wants to talk Beam. Please
reach
out to me directly if you'd like to meet.

Thanks!
Dan



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com







--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Meet up at ApacheCon Seville

2016-11-10 Thread Jean-Baptiste Onofré

Hi guys,

thanks Dan for the e-mail !

On my side, I will be in Sevilla from Monday to Thursday.

Let's meet all together and share some bears ;)

Regards
JB

On 11/11/2016 07:47 AM, Dan Halperin wrote:

Hey folks,

Who will be attending Apache Big Data / ApacheCon next week in Sevilla? JB
and I will be there to give a Beam talk Wednesday morning; I'm around all
week.

I'd love to sync in person with anyone who wants to talk Beam. Please reach
out to me directly if you'd like to meet.

Thanks!
Dan



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [DISCUSS] Change "RunnableOnService" To A More Intuitive Name

2016-11-09 Thread Jean-Baptiste Onofré

Hi Mark,

Generally speaking, I agree.

As RunnableOnService extends NeedsRunner, @TestsWithRunner or 
@RunOnRunner sound clearer.


Regards
JB

On 11/09/2016 09:00 PM, Mark Liu wrote:

Hi all,

I'm working on building RunnableOnService in Python SDK. After having
discussions with folks, "RunnableOnService" looks like not a very intuitive
name for those unit tests that require runners and build lightweight
pipelines to test specific components. Especially, they don't have to run
on a service.

So I want to raise this idea to the community and see if anyone have
similar thoughts. Maybe we can come up with a name this is tight to runner.
Currently, I have two names in my head:

- TestsWithRunners
- RunnerExecutable

Any thoughts?

Thanks,
Mark



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: SBT/ivy dependency issues

2016-11-09 Thread Jean-Baptiste Onofré

Hi Abbass,

As discussed together, it could be related to some changes we did in the 
Maven profiles and build.


Let me investigate.

I keep you posted.

Thanks !
Regards
JB

On 11/09/2016 03:03 PM, amarouni wrote:

Hi guys,

I'm facing a weird issue with a Scala project (using SBT/ivy) that uses
*beam-runners-spark:0.3.0-incubating *which depends on
*beam-sdks-java-core *& *beam-runners-core-java*.

Until recently everything worked as expected i.e I had to declare a
single dependency on *beam-runners-spark:0.3.0-incubating *which brought
with it *beam-sdks-java-core *& *beam-runners-core-java*, but a couple
of weeks ago I started having issues where the only workaround was to
explicitly declare dependencies on *beam-runners-spark:0.3.0-incubating
*in addition to its direct beam dependencies : *beam-sdks-java-core *&
*beam-runners-core-java*.

I verified that *beam-runners-spark's *pom contains both of the
*beam-sdks-java-core *& *beam-runners-core-java *dependencies but still
had to declare them explicitly, I'm not sure if this is an issue with
SBT/ivy because Maven can correctly fetch the required beam dependencies
but this issue appears only with beam dependencies.

Did anyone with SBT/ivy encounter this issue.

Thanks,

Abbass,





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PCollection to PCollection Conversion

2016-11-09 Thread Jean-Baptiste Onofré

Hi Ismaël,

you are right: it's not necessary a DSL by its own (even if I think it 
could make sense as we could provide convenient notation like .marshal() 
or .unmarshal() for instance), it could be an "extension" jar providing 
those transforms.


I think the SDKs should be low level, and new "extensions" (for now in 
Beam) can provide convenient transforms or DSLs (I'm thinking about 
machine learning extension too for instance).


Clearly, it extends the scope of the project by itself, and I think it's 
a great thing ;) It will allow new contributors to work on different 
part of the project.


Just my $0.01 ;)

Regards
JB

On 11/09/2016 03:03 PM, Ismaël Mejía wrote:

​Nice discussion, and thanks Jesse for bringing this subject back.

I agree 100% with Amit and the idea of having a home for those transforms
that are not core enough to be part of the sdk, but that we all end up
re-writing somehow.

This is a needed improvement to be more developer friendly, but also as a
reference of good practices of Beam development, and for this reason I
agree with JB that at this moment it would be better for these transforms
to reside in the Beam repository at least for visibility reasons.

One additional question is if these transforms represent a different DSL or
if those could be grouped with the current extensions (e.g. Join and
SortValues) into something more general that we as a community could
maintain, but well even if it is not the case, it would be really nice to
start working on something like this.

Ismaël Mejía​


On Wed, Nov 9, 2016 at 11:59 AM, Jean-Baptiste Onofré 
wrote:


Related to spark-package, we also have Apache Bahir to host
connectors/transforms for Spark and Flink.

IMHO, right now, Beam should host this, not sure if it makes sense
directly in the core.

It reminds me the "Integration" DSL we discussed in the technical vision
document.

Regards
JB


On 11/09/2016 11:17 AM, Amit Sela wrote:


I think Jesse has a very good point on one hand, while Luke's and
Kenneth's
worries about committing users to specific implementations is in place.

The Spark community has a 3rd party repository for useful libraries that
for various reasons are not a part of the Apache Spark project:
https://spark-packages.org/.

Maybe a "common-transformations" package would serve both users quick
ramp-up and ease-of-use while keeping Beam more "enabling" ?

On Tue, Nov 8, 2016 at 9:03 PM Kenneth Knowles 
wrote:

It seems useful for small scale debugging / demoing to have

Dump.toString(). I think it should be named to clearly indicate its
limited
scope. Maybe other stuff could go in the Dump namespace, but
"Dump.toJson()" would be for humans to read - so it should be pretty
printed, not treated as a machine-to-machine wire format.

The broader question of representing data in JSON or XML, etc, is already
the subject of many mature libraries which are already easy to use with
Beam.

The more esoteric practice of implicit or semi-implicit coercions seems
like it is also already addressed in many ways elsewhere.
Transform.via(TypeConverter) is basically the same as
MapElements.via() and also easy to use with Beam.

In both of the last cases, there are many reasonable approaches, and we
shouldn't commit our users to one of them.

On Tue, Nov 8, 2016 at 10:15 AM, Lukasz Cwik 
wrote:

The suggestions you give seem good except for the the XML cases.


Might want to have the XML be a document per line similar to the JSON
examples you have been giving.

On Tue, Nov 8, 2016 at 12:00 PM, Jesse Anderson 
wrote:

@lukasz Agreed there would have to be KV handling. I was more think



that



whatever the addition, it shouldn't just handle KV. It should handle

Iterables, Lists, Sets, and KVs.

For JSON and XML, I wonder if we'd be able to give someone something
general purpose enough that you would just end up writing your own code


to


handle it anyway.

Here are some ideas on what it could look like with a method and the
resulting string output:
*Stringify.toJSON()*

With KV:
{"key": "value"}

With Iterables:
["one", "two", "three"]

*Stringify.toXML("rootelement")*

With KV:


With Iterables:

  one
  two
  three


*Stringify.toDelimited(",")*

With KV:
key,value

With Iterables:
one,two,three

Do you think that would strike a good balance between reusable code and
writing your own for more difficult formatting?

Thanks,

Jesse

On Tue, Nov 8, 2016 at 11:01 AM Lukasz Cwik 
wrote:

Jesse, I believe if one format gets special treatment in TextIO, people
will then ask why doesn't JSON, XML, ... also not supported.

Also, the example that you provide is using the fact that the input


format


is an Iterable. You had posted a question about using KV with
TextIO.Write which wouldn't align with the proposed input format and


still


would require to 

Re: PCollection to PCollection Conversion

2016-11-09 Thread Jean-Baptiste Onofré
ion.

To do a minimal WordCount, you have to manually convert the KV

to a

String:

p
.apply(TextIO.Read.from("playing_cards.tsv"))
.apply(Regex.split("\\W+"))
.apply(Count.perElement())
*.apply(MapElements.via((KV count)

->*

*count.getKey() + ":" +

count.getValue()*

*).withOutputType(

TypeDescriptors.strings()))*

.apply(TextIO.Write.to("output/stringcounts"));

This code really should be something like:
p
.apply(TextIO.Read.from("playing_cards.tsv"))
.apply(Regex.split("\\W+"))
.apply(Count.perElement())
*.apply(ToString.stringify())*
.apply(TextIO.Write.to("output/stringcounts"));

To summarize the discussion:

   - JA: Add a method to StringDelegateCoder to output any KV or

list

   - JA and DH: Add a SimpleFunction that takes an type and runs

toString()

   on it:
   class ToStringFn extends SimpleFunction
String>

{

   public static String apply(InputT input) {
   return input.toString();
   }
   }
   - JB: Add a general purpose type converter like in Apache

Camel.

   - JA: Add Object support to TextIO.Write that would write out

the

   toString of any Object.

My thoughts:

Is converting to a PCollection mostly needed when you're

using

TextIO.Write? Will a general purpose transform only work in

certain

cases

and you'll normally have to write custom code format the strings

the

way

you want them?

IMHO, it's yes to both. I'd prefer to add Object support to

TextIO.Write

or

a SimpleFunction that takes a delimiter as an argument. Making a
SimpleFunction that's able to specify a delimiter (and perhaps a

prefix

and

suffix) should cover the majority of formats and cases.

Thanks,

Jesse















--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PCollection to PCollection Conversion

2016-11-08 Thread Jean-Baptiste Onofré
Agree. That's why I think it could be interesting to provide ready to use type 
converter ParDo with popular data format (string but more than that json, XML, 
...). It's what I meant by type converters.

Regards
JB 

⁣​

On Nov 8, 2016, 16:31, at 16:31, Lukasz Cwik  wrote:
>The conversion from object to string will have uses outside of just
>TextIO.Write so it seems logical that we would want to have a ParDo do
>the
>conversion.
>
>Text file formats have a lot of variance, even if you consider the
>subset
>of CSV like formats where it could have fixed width fields, or escaping
>and
>quoting around other fields, or headers that should be placed at the
>top.
>
>Having all these format conversions within TextIO.Write seems like a
>lot of
>logic to contain in that transform which should just focus on writing
>to
>files.
>
>On Tue, Nov 8, 2016 at 8:15 AM, Jesse Anderson 
>wrote:
>
>> This is a thread moved over from the user mailing list.
>>
>> I think there needs to be a way to convert a PCollection to
>> PCollection Conversion.
>>
>> To do a minimal WordCount, you have to manually convert the KV to a
>String:
>> p
>> .apply(TextIO.Read.from("playing_cards.tsv"))
>> .apply(Regex.split("\\W+"))
>> .apply(Count.perElement())
>> *.apply(MapElements.via((KV count) ->*
>> *count.getKey() + ":" + count.getValue()*
>> *   
>).withOutputType(TypeDescriptors.strings()))*
>> .apply(TextIO.Write.to("output/stringcounts"));
>>
>> This code really should be something like:
>> p
>> .apply(TextIO.Read.from("playing_cards.tsv"))
>> .apply(Regex.split("\\W+"))
>> .apply(Count.perElement())
>> *.apply(ToString.stringify())*
>> .apply(TextIO.Write.to("output/stringcounts"));
>>
>> To summarize the discussion:
>>
>>- JA: Add a method to StringDelegateCoder to output any KV or list
>>- JA and DH: Add a SimpleFunction that takes an type and runs
>toString()
>>on it:
>>class ToStringFn extends SimpleFunction {
>>public static String apply(InputT input) {
>>return input.toString();
>>}
>>}
>>- JB: Add a general purpose type converter like in Apache Camel.
>>- JA: Add Object support to TextIO.Write that would write out the
>>toString of any Object.
>>
>> My thoughts:
>>
>> Is converting to a PCollection mostly needed when you're
>using
>> TextIO.Write? Will a general purpose transform only work in certain
>cases
>> and you'll normally have to write custom code format the strings the
>way
>> you want them?
>>
>> IMHO, it's yes to both. I'd prefer to add Object support to
>TextIO.Write or
>> a SimpleFunction that takes a delimiter as an argument. Making a
>> SimpleFunction that's able to specify a delimiter (and perhaps a
>prefix and
>> suffix) should cover the majority of formats and cases.
>>
>> Thanks,
>>
>> Jesse
>>


Re: [PROPOSAL] Merge apex-runner to master branch

2016-11-08 Thread Jean-Baptiste Onofré
+1

Great work Thomas !!

Regards
JB

⁣​

On Nov 8, 2016, 14:54, at 14:54, Thomas Weise  wrote:
>Hi,
>
>As per previous discussion [1], I would like to propose to merge the
>apex-runner branch into master. The runner satisfies the criteria
>outlined
>in [2] and merging it to master will give more visibility to other
>contributors and users.
>
>Specifically the Apex runner addresses:
>
>   - Have at least 2 contributors interested in maintaining it, and 1
>  committer interested in supporting it:  *I'm going to sign up for the
>support and there are more folks interested. Some have already
>contributed
>and helped with PR reviews, others from the Apex community have
>expressed
>   interest [3].*
>- Provide both end-user and developer-facing documentation:  *Runner
>has
> README, capability matrix, Javadoc. Planning to add it to the tutorial
>   later.*
>   - Have at least a basic level of unit test coverage:  *Has 30 runner
>   specific tests and passes all Beam RunnableOnService tests.*
>   - Run all existing applicable integration tests with other Beam
>components and create additional tests as appropriate: * Enabled runner
>   for examples integration tests in the same way as other runners.*
>- Be able to handle a subset of the model that address a significant
>set of
>   use cases (aka. ‘traditional batch’ or ‘processing time
>streaming’):  *Passes
>   RunnableOnService without exclusions and example IT.*
>   - Update the capability matrix with the current status:  *Done.*
>- Add a webpage under learn/runners: *Same "TODO" page as other runners
>   added to site.*
>
>The PR for the merge:
>https://github.com/apache/incubator-beam/pull/1305
>
>(There are intermittent test failures in individual Travis runs that
>are
>unrelated to the runner.)
>
>Thanks,
>Thomas
>
>[1]
>https://lists.apache.org/thread.html/2b420a35f05e47561f27c19e8ec6484f595553f32da88fe593ad931d@%3Cdev.beam.apache.org%3E
>
>[2]
>http://beam.apache.org/contribute/contribution-guide/#feature-branches
>
>[3]
>https://lists.apache.org/thread.html/6e7618768cdcde81c28aa9883a1fcf4d3d4e41de4249547
>
>130691d52@%3Cdev.apex.apache.org%3E
>


Re: Timer and Window behavior

2016-11-06 Thread Jean-Baptiste Onofré

Hi Demin,

I remember to have seen an improvement about watermark in KafkaIO 
(BEAM-591).

I advise you to take a look there.

Regards
JB

On 11/06/2016 01:31 PM, Demin Alexey wrote:

Hi

I read Unbound stream (read from kafka) and grouped by value,
but on low-throughput streams I have strange behavior:

stream.apply(Window.into(FixedWindows.of(Duration.millis(10.apply(GroupByKey.create())
or
stream.apply(
 Window.into(FixedWindows.of(Duration.millis(10)))

 .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))
.apply(GroupByKey.create())

1ms  event1
3ms  event2
11ms event3 - (triger window)
12ms event4
13ms event5
21ms event6 - (triger window)
22ms event7
23ms event8



5m00ms  event9

As result event7 and event8 stay in windows without processing next 5 min
Window and GroupBy will create only on event9

Behavior can reproduce on DirectRunner and FlinkRunner.

This is bug or incorrect using API from my side?



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Tutorials

2016-11-04 Thread Jean-Baptiste Onofré

Hi Jesse,

Not a full day. I did kind of half day tutorial to my team.

Regards
JB

On 11/04/2016 07:23 PM, Jesse Anderson wrote:

Has anyone done a full day (~6 hours) tutorial on Beam yet?

Thanks,

Jesse



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Contributing to Beam docs

2016-11-04 Thread Jean-Baptiste Onofré

Thanks Hadar,

it looks better for me !

Just a small comment: I think we need to give more visibility to what is 
provided by Beam: SDKs/DSLs, I/Os, Runners.


It's one of my intention in the new skin proposal PR 
(https://github.com/apache/incubator-beam-site/pull/64) in the Learn and 
Overview sections.


Something like what we have in Apache Camel for the components (likely 
equivalent to Beam I/Os) would help users IMHO.


Thanks again and welcome aboard !

Regards
JB

On 11/04/2016 03:42 AM, Hadar Hod wrote:

Hi Beamers!

I'm Hadar. I've worked on Dataflow documentation in the past, and have
recently started contributing to the Beam docs. I'm excited to work with
all of you and to be a part of this project. :)

I believe the current structure of the website can be improved, so I'd like
to propose a slightly different one. On a high level, I propose changing
the tabs in the top menu and reorganizing some of the docs. Instead of the
current "Use", "Learn", "Contribute", "Blog", and "Project" tabs, we could
have "Get Started", "Documentation", "Contribute", and "Blog".

I applied this new structure in a pull request
<https://github.com/apache/incubator-beam-site/pull/62>, which is staged
here
<http://apache-beam-website-pull-requests.storage.googleapis.com/62/index.html>.
If you've worked on the website before, you've probably run into this -
note that you'll have to append "/index.html" to each URL.

Thoughts? Thanks!

Hadar



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Podling Report Reminder - November 2016

2016-11-02 Thread Jean-Baptiste Onofré

Hi guys,

I just updated the wiki page with the Beam podling report. You can still 
review it there:


https://wiki.apache.org/incubator/November2016

Regards
JB

On 11/02/2016 07:30 AM, James Malone wrote:

Beam

Apache Beam is an open source, unified model and set of language-specific
SDKs for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Beam pipelines simplify the
mechanics of large-scale batch and streaming data processing and can run on
a number of runtimes such as Apache Flink, Apache Gearpump, Apache Apex,
Apache Spark, and Google Cloud Dataflow. Beam also brings SDKs in different
languages, allowing users to easily implement their data integration
processes.

Beam has been incubating since 2016-02-01.

The most important issue to address in the move towards graduation:
 1. Make it easier for the Beam community to to learn, use, and grow by
expanding and improving the Beam documentation, code samples, and the
website

Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be aware
of?
None.

How has the community developed since the last report?
 * 441 closed/merged pull requests
 * High engagement on dev and user mailing lists (742 / 179 messages)
 * Several public talks, articles, and videos including:
- @Scale San Jose (“No shard left behind: APIs for massive parallel
efficiency in Apache Beam”)
- Strata + Hadoop World NYC (“Learn stream processing with Apache Beam”)
- Paris Spark Meetup (“Introduction to Apache Beam”)
- Hadoop Summit Melbourne (“Stream/Batch processing portable across
on-prem (Spark, Flink) and Cloud with Apache Beam”)
- Hadoop User Group Taipei (“Stream Processing with Beam and Google
Cloud Dataflow”)
- Data Science Lab London (“Apache Beam: Stream and Batch Processing;
Unified and Portable!”)

How has the project developed since the last report?
Major developments on the project since last report include the following:
* Second and third incubating release (0.2.0 and 0.3.0) and a release guide
[1]
* New DirectRunner support for testing streaming pipelines[2]
* Continued improvements to the Flink, Spark, and Dataflow runners
* Added support for new IO connectors, including MongoDB, Kinesis, and JDBC
with Cassandra, MQTT support pending in pull requests
* Addition of the Apache Apex runner on a feature branch, and continued
work on the Apache Gearpump runner and Python SDK feature branches. [3]
* Continued reorganization and refactoring of the project
* Continued improvements to documentation and testing

[1]: http://beam.incubator.apache.org/contribute/release-guide/
[2]: http://beam.incubator.apache.org/blog/2016/10/20/test-stream.html
[3]: http://beam.incubator.apache.org/contribute/work-in-
progress/#feature-branches

Dates of last releases:
 * 2016/08/07 - 0.2.0-incubating
 * 2016/10/31 - 0.3.0-incubating

When were the last committers or PMC members elected?
The following committers were elected on 2016/10/20:
 * Thomas Weise
 * Jesse Anderson
 * Thomas Groh

Signed-off-by:
 [ ](beam) Jean-Baptiste Onofre
 [ ](beam) Venkatesh Seetharam
 [ ](beam) Ted Dunning

Shepherd/Mentor notes:


--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: PAssert.GroupedGlobally defaults to a single empty Iterable.

2016-11-02 Thread Jean-Baptiste Onofré

Agree, this element should be removed.

Regards
JB

On 11/02/2016 10:53 AM, Amit Sela wrote:

Hi all,

I've been looking at PAssert and I've notice that PAssert.GroupedGlobally
points
<https://github.com/apache/incubator-beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/testing/PAssert.java#L825>
that it will result in a singe empty iterable even if the input PCollection
is empty.
This is a weird behaviour as it may cause following assertions to fail.

Wouldn't it be more correct to remove (filter out ?) this element ?

Thanks,
Amit



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Podling Report Reminder - November 2016

2016-11-01 Thread Jean-Baptiste Onofré

Hi James,

it looks good to me.

Just a minor thing regarding the releases, we just have to mention the 
ones with did in the last quarter (no need to mention 0.1.0-incubating).


Regards
JB

On 11/02/2016 12:12 AM, James Malone wrote:

Howdy,

Sorry for being delayed; here is a proposal for our podling report!

James

---

Beam

Apache Beam is an open source, unified model and set of language-specific
SDKs for defining and executing data processing workflows, and also data
ingestion and integration flows, supporting Enterprise Integration Patterns
(EIPs) and Domain Specific Languages (DSLs). Beam pipelines simplify the
mechanics of large-scale batch and streaming data processing and can run on
a number of runtimes such as Apache Flink, Apache Gearpump, Apache Apex,
Apache Spark, and Google Cloud Dataflow (a cloud service). Beam also brings
SDKs in different languages, allowing users to easily implement their data
integration processes.

Beam has been incubating since 2016-02-01.

The most important issue to address in the move towards graduation:

 1. Make it easier for the beam community to to learn, use, and grow by
expanding and improving the Beam documentation, code samples, and the
website

Any issues that the Incubator PMC (IPMC) or ASF Board wish/need to be
aware of?
None.

How has the community developed since the last report?
 * 441 closed/merged pull requests
 * High engagement on dev and user mailing lists (742 / 179 messages)
 * Several public talks, articles, and videos including:
- @Scale San Jose (“No shard left behind: APIs for massive parallel
efficiency”)
- Strata + Hadoop World NYC (“Learn stream processing with Apache Beam”)
- Paris Spark Meetup (“Introduction to Apache Beam”)
- Hadoop Summit Melbourne (“Stream/Batch processing portable across
on-prem (Spark, Flink) and Cloud with Apache Beam”)
- Hadoop User Group Taipei (“Stream Processing with Beam and Google
Cloud Dataflow”)
- Data Science Lab London (“Apache Beam: Stream and Batch Processing;
Unified and Portable!”)

How has the project developed since the last report?
Major developments on the project since last report include the following:
* Second and third incubating release (0.2.0 and 0.3.0) and a release guide
[1]
* New DirectRunner support for testing streaming pipelines[2]
* Continued improvements to the Flink, Spark, and Dataflow runners
* Added support for new IO connectors, including MongoDB, Kinesis, and JDBC
with Cassandra, MQTT support pending in pull requests
* Addition of the Apache Apex runner on a feature branch, and continued
work on the Apache Gearpump runner and Python SDK feature branches. [3]
* Continued reorganization and refactoring of the project
* Continued improvements to documentation and testing

[1]: http://beam.incubator.apache.org/contribute/release-guide/
[2]: http://beam.incubator.apache.org/blog/2016/10/20/test-stream.html
[3]:
http://beam.incubator.apache.org/contribute/work-in-progress/#feature-branches


Dates of last releases:
 * 2016/06/15 - 0.1.0-incubating
 * 2016/08/07 - 0.2.0-incubating
 * 2016/10/31 - 0.3.0-incubating

When were the last committers or PMC members elected?
The following committers were elected on 2016/10/20:
 * Thomas Weise
 * Jesse Anderson
 * Thomas Groh

Signed-off-by:
 [ ](beam) Jean-Baptiste Onofre
 [ ](beam) Venkatesh Seetharam
 [ ](beam) Ted Dunning

Shepherd/Mentor notes:

On Mon, Oct 31, 2016 at 10:55 PM, Jean-Baptiste Onofré 
wrote:


Hi James,

Sorry to bother you again: do you have any update about the podling

report (I checked on the incubator wiki and it's still empty) ?


We would need a little time to review and sign.

Please, let me know if I can help you on this.

Thanks !
Regards
JB

On 10/27/2016 01:05 AM, James Malone wrote:


Hello everyone!

Unless anyone disagrees or wants to do it, I am happy to volunteer to

draft

this podling report for review before we submit it. I can get it done

for a

review this Friday (US-Pacific) if that works.

Cheers!

James

On Wed, Oct 26, 2016 at 4:01 PM,  wrote:


Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 16 November 2016, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, November 02).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowled

Re: Podling Report Reminder - November 2016

2016-10-31 Thread Jean-Baptiste Onofré

Hi James,

Sorry to bother you again: do you have any update about the podling 
report (I checked on the incubator wiki and it's still empty) ?


We would need a little time to review and sign.

Please, let me know if I can help you on this.

Thanks !
Regards
JB

On 10/27/2016 01:05 AM, James Malone wrote:

Hello everyone!

Unless anyone disagrees or wants to do it, I am happy to volunteer to draft
this podling report for review before we submit it. I can get it done for a
review this Friday (US-Pacific) if that works.

Cheers!

James

On Wed, Oct 26, 2016 at 4:01 PM,  wrote:


Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 16 November 2016, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, November 02).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.

This should be appended to the Incubator Wiki page at:

http://wiki.apache.org/incubator/November2016

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: [ANNOUNCE] Beam 0.3.0-incubating Released

2016-10-31 Thread Jean-Baptiste Onofré

Awesome, great !!

Thanks Aljoscha for this release !
Great job team !

Regards
JB

On 10/31/2016 05:36 PM, Aljoscha Krettek wrote:

Congratulations, team! I just finalised everything for the most recent
release. The artefacts are on Maven, the website is updated and the source
release should slowly propagate through the Apache servers.

I'll also send an email to the user list to highlight some of the new
features.

Cheers,
Aljoscha



--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Podling Report Reminder - November 2016

2016-10-30 Thread Jean-Baptiste Onofré

Hi James,

any update about the podling report ?
Can I help you on this ?

Thanks !
Regards
JB

On 10/27/2016 01:05 AM, James Malone wrote:

Hello everyone!

Unless anyone disagrees or wants to do it, I am happy to volunteer to draft
this podling report for review before we submit it. I can get it done for a
review this Friday (US-Pacific) if that works.

Cheers!

James

On Wed, Oct 26, 2016 at 4:01 PM,  wrote:


Dear podling,

This email was sent by an automated system on behalf of the Apache
Incubator PMC. It is an initial reminder to give you plenty of time to
prepare your quarterly board report.

The board meeting is scheduled for Wed, 16 November 2016, 10:30 am PDT.
The report for your podling will form a part of the Incubator PMC
report. The Incubator PMC requires your report to be submitted 2 weeks
before the board meeting, to allow sufficient time for review and
submission (Wed, November 02).

Please submit your report with sufficient time to allow the Incubator
PMC, and subsequently board members to review and digest. Again, the
very latest you should submit your report is 2 weeks prior to the board
meeting.

Thanks,

The Apache Incubator PMC

Submitting your Report

--

Your report should contain the following:

*   Your project name
*   A brief description of your project, which assumes no knowledge of
the project or necessarily of its field
*   A list of the three most important issues to address in the move
towards graduation.
*   Any issues that the Incubator PMC or ASF Board might wish/need to be
aware of
*   How has the community developed since the last report
*   How has the project developed since the last report.

This should be appended to the Incubator Wiki page at:

http://wiki.apache.org/incubator/November2016

Note: This is manually populated. You may need to wait a little before
this page is created from a template.

Mentors
---

Mentors should review reports for their project(s) and sign them off on
the Incubator wiki page. Signing off reports shows that you are
following the project - projects that are not signed may raise alarms
for the Incubator PMC.

Incubator PMC





--
Jean-Baptiste Onofré
jbono...@apache.org
http://blog.nanthrax.net
Talend - http://www.talend.com


Re: Contributing for improvement of the Beam website

2016-10-29 Thread Jean-Baptiste Onofré
Please let me know if you need any help.

Regards
JB

⁣​

On Oct 29, 2016, 13:10, at 13:10, Minudika Malshan  
wrote:
>Hi
>
>Thanks a lot! I will go through it.
>
>BR
>
>On Sat, Oct 29, 2016 at 4:27 PM, Jean-Baptiste Onofré 
>wrote:
>
>> Hi
>>
>> You can submit PR on incubator-beam-website.
>>
>> Take a look on site README about the Jekyll use etc.
>>
>> Regards
>> JB
>>
>> ⁣​
>>
>> On Oct 29, 2016, 12:54, at 12:54, Minudika Malshan
>
>> wrote:
>> >Hi all,
>> >
>> >I would like to know how to submit patches for Apache Beam website?
>> >I went through this[1] documentation. But beam has not been listed
>in
>> >CMS.
>> >Could someone please point out how to do the modifications and
>submit
>> >patches for the website?
>> >
>> >[1] http://apache.org/dev/contributors.html#websites
>> >
>> >Thanks!
>> >
>> >--
>> >*Minudika Malshan*
>>
>
>
>
>-- 
>*Minudika Malshan*


Re: Contributing for improvement of the Beam website

2016-10-29 Thread Jean-Baptiste Onofré
Hi

You can submit PR on incubator-beam-website.

Take a look on site README about the Jekyll use etc.

Regards
JB

⁣​

On Oct 29, 2016, 12:54, at 12:54, Minudika Malshan  
wrote:
>Hi all,
>
>I would like to know how to submit patches for Apache Beam website?
>I went through this[1] documentation. But beam has not been listed in
>CMS.
>Could someone please point out how to do the modifications and submit
>patches for the website?
>
>[1] http://apache.org/dev/contributors.html#websites
>
>Thanks!
>
>-- 
>*Minudika Malshan*


Re: [VOTE] Apache Beam release 0.3.0-incubating

2016-10-28 Thread Jean-Baptiste Onofré
+1 (binding)

Regards
JB

⁣​

On Oct 28, 2016, 10:58, at 10:58, Aljoscha Krettek  wrote:
>Hi everyone,
>Please review and vote on the release candidate #1 for the Apache Beam
>version 0.3.0-incubating, as follows:
>[ ] +1, Approve the release
>[ ] -1, Do not approve the release (please provide specific comments)
>
>
>The complete staging area is available for your review, which includes:
>* JIRA release notes [1],
>* the official Apache source release to be deployed to dist.apache.org
>[2],
>* all artifacts to be deployed to the Maven Central Repository [3],
>* source code tag "v0.3.0-incubating-RC1" [4],
>* website pull request listing the release and publishing the API
>reference
>manual [5].
>
>The Apache Beam community has unanimously approved this release [6].
>
>As customary, the vote will be open for at least 72 hours. It is
>adopted by
>a majority approval with at least three PMC affirmative votes. If
>approved,
>we will proceed with the release.
>
>Thanks!
>
>[1]
>https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338051
>[2]
>https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
>[3]
>https://repository.apache.org/content/repositories/staging/org/apache/beam/
>[4]
>https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=5d86ff7f04862444c266142b0d5acecb5a6b7144
>[5] https://github.com/apache/incubator-beam-site/pull/52
>[6]
>https://lists.apache.org/thread.html/b3736acb5edcea247a5a6a64c09ecacab794461bf1ea628152faeb82@%3Cdev.beam.apache.org%3E


Re: [VOTE] Apache Beam release 0.3.0-incubating

2016-10-28 Thread Jean-Baptiste Onofré
Hi John

Rat is supposed to run with the release profile. We are going to check that and 
why DEPENDENCIES file has not been checked.

Regarding Kinesis, the dependency should not be embedded in any Beam jar or 
distribution. The user has to explicitly define the dependency to be able to 
use the IO. So it should not be an issue. Let me check if the scope is actually 
provided there.

Thanks
Regards
JB

⁣​

On Oct 29, 2016, 02:05, at 02:05, "John D. Ament"  wrote:
>Hi,
>
>mvn apache-rat:check fails on your release due to the DEPENDENCIES file
>not
>having a header.  If you don't need this file, please remove it.  I
>would
>also recommend leaving apache-rat running all the time to avoid newly
>introduced issues.
>
>In addition, I notice that your build output includes dependencies on
>aws-kinesis-client, which is Amazon Software Licensed.  Have you
>received
>clarification on whether you can include or not?
>
>John
>
>
>
>On Fri, Oct 28, 2016 at 4:49 AM Aljoscha Krettek 
>wrote:
>
>> Hi everyone,
>> Please review and vote on the release candidate #1 for the Apache
>Beam
>> version 0.3.0-incubating, as follows:
>> [ ] +1, Approve the release
>> [ ] -1, Do not approve the release (please provide specific comments)
>>
>>
>> The complete staging area is available for your review, which
>includes:
>> * JIRA release notes [1],
>> * the official Apache source release to be deployed to
>dist.apache.org
>> [2],
>> * all artifacts to be deployed to the Maven Central Repository [3],
>> * source code tag "v0.3.0-incubating-RC1" [4],
>> * website pull request listing the release and publishing the API
>reference
>> manual [5].
>>
>> The Apache Beam community has unanimously approved this release [6].
>>
>> As customary, the vote will be open for at least 72 hours. It is
>adopted by
>> a majority approval with at least three PMC affirmative votes. If
>approved,
>> we will proceed with the release.
>>
>> Thanks!
>>
>> [1]
>>
>>
>https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527&version=12338051
>> [2]
>>
>https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-incubating/
>> [3]
>>
>https://repository.apache.org/content/repositories/staging/org/apache/beam/
>> [4]
>>
>>
>https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=5d86ff7f04862444c266142b0d5acecb5a6b7144
>> [5] https://github.com/apache/incubator-beam-site/pull/52
>> [6]
>>
>>
>https://lists.apache.org/thread.html/b3736acb5edcea247a5a6a64c09ecacab794461bf1ea628152faeb82@%3Cdev.beam.apache.org%3E
>>


Re: Podling Report Reminder - November 2016

2016-10-27 Thread Jean-Baptiste Onofré
Perfect.

Thanks James !

Regards
JB

⁣​

On Oct 27, 2016, 01:05, at 01:05, James Malone  
wrote:
>Hello everyone!
>
>Unless anyone disagrees or wants to do it, I am happy to volunteer to
>draft
>this podling report for review before we submit it. I can get it done
>for a
>review this Friday (US-Pacific) if that works.
>
>Cheers!
>
>James
>
>On Wed, Oct 26, 2016 at 4:01 PM,  wrote:
>
>> Dear podling,
>>
>> This email was sent by an automated system on behalf of the Apache
>> Incubator PMC. It is an initial reminder to give you plenty of time
>to
>> prepare your quarterly board report.
>>
>> The board meeting is scheduled for Wed, 16 November 2016, 10:30 am
>PDT.
>> The report for your podling will form a part of the Incubator PMC
>> report. The Incubator PMC requires your report to be submitted 2
>weeks
>> before the board meeting, to allow sufficient time for review and
>> submission (Wed, November 02).
>>
>> Please submit your report with sufficient time to allow the Incubator
>> PMC, and subsequently board members to review and digest. Again, the
>> very latest you should submit your report is 2 weeks prior to the
>board
>> meeting.
>>
>> Thanks,
>>
>> The Apache Incubator PMC
>>
>> Submitting your Report
>>
>> --
>>
>> Your report should contain the following:
>>
>> *   Your project name
>> *   A brief description of your project, which assumes no knowledge
>of
>> the project or necessarily of its field
>> *   A list of the three most important issues to address in the move
>> towards graduation.
>> *   Any issues that the Incubator PMC or ASF Board might wish/need to
>be
>> aware of
>> *   How has the community developed since the last report
>> *   How has the project developed since the last report.
>>
>> This should be appended to the Incubator Wiki page at:
>>
>> http://wiki.apache.org/incubator/November2016
>>
>> Note: This is manually populated. You may need to wait a little
>before
>> this page is created from a template.
>>
>> Mentors
>> ---
>>
>> Mentors should review reports for their project(s) and sign them off
>on
>> the Incubator wiki page. Signing off reports shows that you are
>> following the project - projects that are not signed may raise alarms
>> for the Incubator PMC.
>>
>> Incubator PMC
>>


Re: Can we have more quick start examples ?

2016-10-27 Thread Jean-Baptiste Onofré
Yes it sounds good to me. I would love to see this as part of the examples.

Ismael and I also started the beam-samples 
(http://github.com/jbonofre/beam-samples) that could be part of the examples.
The purpose is to have more real use cases implementation with real data.

Regards
JB

⁣​

On Oct 27, 2016, 17:17, at 17:17, Jesse Anderson  wrote:
>


Re: [DISCUSS] Using Verbs for Transforms

2016-10-27 Thread Jean-Baptiste Onofré
You did well ! It's an interesting discussion we have and it's great to have it 
on the mailing list (better than in Jira or PR comments IMHO).

Thanks !

Regards
JB

⁣​

On Oct 27, 2016, 20:39, at 20:39, Robert Bradshaw  
wrote:
>+1 to all Dan says.
>
>I only brought this up because it seemed new contributors (yay)
>jumping in and renaming a core transform based on "Something to
>consider" deserved a couple more more eyeballs, but didn't intend for
>it to become a big deal.
>
>On Thu, Oct 27, 2016 at 11:03 AM, Dan Halperin
> wrote:
>> Folks, I don't think this needs to be a "vote". This is just not that
>big a
>> deal :). It is important to be transparent and have these discussions
>on
>> the list, which is why we brought it here from GitHub/JIRA, but at
>the end
>> of the day I hope that a small group of committers and developers can
>> assess "good enough" consensus for these minor issues.
>>
>> Here's my assessment:
>> * We don't really have any rules about naming transforms. "Should be
>a
>> verb" is a sort of guiding principle inherited from the Google Flume
>> project from which Dataflow evolved, but honestly we violate this
>rule for
>> clarity all over the place. ("Values", for example).
>> * The "Big Data" community is significantly more familiar with the
>concept
>> of Distinct -- Jesse, who filed the original JIRA, is a good example
>here.
>> * Finally, nobody feels very strongly. We could argue minor points of
>each
>> solution, but at the end of the day I don't think anyone wants to
>block a
>> change.
>>
>> Let's go with Distinct. It's important to align Beam with the open
>source
>> big data community. (And thanks Jesse, our newest (*tied) committer,
>for
>> pushing us in the right direction!)
>>
>> Jesse, can you please take charge of wrapping up the PR and merging
>it?
>>
>> Thanks!
>> Dan
>>
>> On Wed, Oct 26, 2016 at 11:12 PM, Jean-Baptiste Onofré
>
>> wrote:
>>
>>> Just to clarify. Davor is right for a code modification change: -1
>means a
>>> veto.
>>> I meant that -1 is not a veto for a release vote.
>>>
>>> Anyway, even if it's not a formal code, we can have a discussion
>with
>>> "options" a,b and c.
>>>
>>> Regards
>>> JB
>>>
>>> ⁣
>>>
>>> On Oct 27, 2016, 06:48, at 06:48, Davor Bonaci
>
>>> wrote:
>>> >In terms of reaching a decision on any code or design changes,
>>> >including
>>> >this one, I'd suggest going without formal votes. Voting process
>for
>>> >code
>>> >modifications between choices A and B doesn't necessarily end with
>a
>>> >decision A or B -- a single (qualified) -1 vote is a veto and
>cannot be
>>> >overridden [1]. Said differently, the guideline is that code
>changes
>>> >should
>>> >be made by consensus; not by one group outvoting another. I'd like
>to
>>> >avoid
>>> >setting such precedent; we should try to drive consensus, as
>opposed to
>>> >attempting to outvote another part of the community.
>>> >
>>> >In this particular case, we have had a great discussion. Many
>>> >contributors
>>> >brought different perspectives. Consequently, some opinions have
>been
>>> >likely changed. At this point, someone should summarize the
>arguments,
>>> >try
>>> >to critique them from a neutral standpoint, and suggest a refined
>>> >proposal
>>> >that takes these perspectives into account. If nobody objects in a
>>> >short
>>> >time, we should consider this decided. [ I can certainly help here,
>but
>>> >I'd
>>> >love to see somebody else do it! ]
>>> >
>>> >[1] http://www.apache.org/foundation/voting.html
>>> >
>>> >On Wed, Oct 26, 2016 at 7:35 AM, Ben Chambers
>>> >
>>> >wrote:
>>> >
>>> >> I also like Distinct since it doesn't make it sound like it
>modifies
>>> >any
>>> >> underlying collection. RemoveDuplicates makes it sound like the
>>> >duplicates
>>> >> are removed, rather than a new PCollection without duplicates
>being
>>> >> returned.
>>> >>
>>> >> On Wed, Oct 26, 2016, 7:36 AM Jean-Baptiste Onofré
>
>>> >> wr

Re: [DISCUSS] Using Verbs for Transforms

2016-10-27 Thread Jean-Baptiste Onofré
It sounds good to me.

So basically you did kind of vote with a proposing solution ;)

Regards
JB

⁣​

On Oct 27, 2016, 20:04, at 20:04, Dan Halperin  
wrote:
>Folks, I don't think this needs to be a "vote". This is just not that
>big a
>deal :). It is important to be transparent and have these discussions
>on
>the list, which is why we brought it here from GitHub/JIRA, but at the
>end
>of the day I hope that a small group of committers and developers can
>assess "good enough" consensus for these minor issues.
>
>Here's my assessment:
>* We don't really have any rules about naming transforms. "Should be a
>verb" is a sort of guiding principle inherited from the Google Flume
>project from which Dataflow evolved, but honestly we violate this rule
>for
>clarity all over the place. ("Values", for example).
>* The "Big Data" community is significantly more familiar with the
>concept
>of Distinct -- Jesse, who filed the original JIRA, is a good example
>here.
>* Finally, nobody feels very strongly. We could argue minor points of
>each
>solution, but at the end of the day I don't think anyone wants to block
>a
>change.
>
>Let's go with Distinct. It's important to align Beam with the open
>source
>big data community. (And thanks Jesse, our newest (*tied) committer,
>for
>pushing us in the right direction!)
>
>Jesse, can you please take charge of wrapping up the PR and merging it?
>
>Thanks!
>Dan
>
>On Wed, Oct 26, 2016 at 11:12 PM, Jean-Baptiste Onofré
>
>wrote:
>
>> Just to clarify. Davor is right for a code modification change: -1
>means a
>> veto.
>> I meant that -1 is not a veto for a release vote.
>>
>> Anyway, even if it's not a formal code, we can have a discussion with
>> "options" a,b and c.
>>
>> Regards
>> JB
>>
>> ⁣​
>>
>> On Oct 27, 2016, 06:48, at 06:48, Davor Bonaci
>
>> wrote:
>> >In terms of reaching a decision on any code or design changes,
>> >including
>> >this one, I'd suggest going without formal votes. Voting process for
>> >code
>> >modifications between choices A and B doesn't necessarily end with a
>> >decision A or B -- a single (qualified) -1 vote is a veto and cannot
>be
>> >overridden [1]. Said differently, the guideline is that code changes
>> >should
>> >be made by consensus; not by one group outvoting another. I'd like
>to
>> >avoid
>> >setting such precedent; we should try to drive consensus, as opposed
>to
>> >attempting to outvote another part of the community.
>> >
>> >In this particular case, we have had a great discussion. Many
>> >contributors
>> >brought different perspectives. Consequently, some opinions have
>been
>> >likely changed. At this point, someone should summarize the
>arguments,
>> >try
>> >to critique them from a neutral standpoint, and suggest a refined
>> >proposal
>> >that takes these perspectives into account. If nobody objects in a
>> >short
>> >time, we should consider this decided. [ I can certainly help here,
>but
>> >I'd
>> >love to see somebody else do it! ]
>> >
>> >[1] http://www.apache.org/foundation/voting.html
>> >
>> >On Wed, Oct 26, 2016 at 7:35 AM, Ben Chambers
>> >
>> >wrote:
>> >
>> >> I also like Distinct since it doesn't make it sound like it
>modifies
>> >any
>> >> underlying collection. RemoveDuplicates makes it sound like the
>> >duplicates
>> >> are removed, rather than a new PCollection without duplicates
>being
>> >> returned.
>> >>
>> >> On Wed, Oct 26, 2016, 7:36 AM Jean-Baptiste Onofré
>
>> >> wrote:
>> >>
>> >> > Agree. It was more a transition proposal.
>> >> >
>> >> > Regards
>> >> > JB
>> >> >
>> >> > ⁣​
>> >> >
>> >> > On Oct 26, 2016, 08:31, at 08:31, Robert Bradshaw
>> >> >  wrote:
>> >> > >On Mon, Oct 24, 2016 at 11:02 PM, Jean-Baptiste Onofré
>> >> > > wrote:
>> >> > >> And what about use RemoveDuplicates and create an alias
>Distinct
>> >?
>> >> > >
>> >> > >I'd really like to avoid (long term) aliases--you end up having
>to
>> >> > >document (and maintain) them both, and it adds confusion as to
>> >whic

Re: [PROPOSAL] New Beam website design?

2016-10-27 Thread Jean-Baptiste Onofré
Great !! Thanks.

You can take a look on BEAM-500 and 501 and also the PR I did last week.

I plan to submit new PRs during the week end. So please let me know how we can 
sync.

Thanks
Regards
JB

⁣​

On Oct 27, 2016, 14:04, at 14:04, Minudika Malshan  
wrote:
>Hi all,
>
>I would like to join for the development of the new site.
>Is there any issue tracking method for this? (Are there any jirra
>issues)
>
>Thank you!
>
>
>
>On Thu, Oct 27, 2016 at 4:01 PM, Jean-Baptiste Onofré 
>wrote:
>
>> Hi
>>
>> You can propose a PR on this Jira.
>>
>> We will be more than happy to review it.
>>
>> Thanks
>> Regards
>> JB
>>
>> ⁣​
>>
>> On Oct 27, 2016, 11:26, at 11:26, Abdullah Bashir
>
>> wrote:
>> >Thank you very much for taking time to respond Davor :)
>> >
>> >Regarding BEAM-752, i can work on that, i have already built some
>> >Dataflow
>> >Piplines on Google Cloud in Python language.
>> >
>> >Again Can you tell me where to start for BEAM-752. I am new to ASF
>> >contribution, so onboarding steps are kind of a black box to me :).
>> >
>> >On Thu, Oct 27, 2016 at 11:34 AM, Davor Bonaci 
>> >wrote:
>> >
>> >> Absolutely!
>> >>
>> >> I'm currently reviewing JB's PR #51, and that should go in
>shortly.
>> >Within
>> >> a day or so, I should have a better idea about future work in this
>> >specific
>> >> area; please stay tuned.
>> >>
>> >> There are also separate things that are ready to be started at any
>> >time.
>> >> BEAM-752 comes to mind first. Is this something you'd be
>interested
>> >in?
>> >>
>> >> On Wed, Oct 26, 2016 at 11:17 PM, Abdullah Bashir
>> >
>> >> wrote:
>> >>
>> >>> Hi Davor,
>> >>>
>> >>> I am done with my local setup to start contributing, I have
>forked
>> >and
>> >>> merged pull request *(**pull/51)* into my  local repo. Then I
>read
>> >the
>> >>>
>> >>> google docs, their are two tasks mentioned in it, as [Beam-500]
>and
>> >>> [Beam-501].
>> >>> I found out that [Beam-500] is closed in JIRA and [Beam-501] is
>> >>> assigned to Jean-Baptiste
>> >>> Onofré, Is their any task that you can assign to me ?
>> >>>
>> >>> Thanks.
>> >>>
>> >>> Regards,
>> >>> Abdullah Bashir
>> >>>
>> >>>
>> >>> On Tue, Oct 25, 2016 at 1:50 AM, Davor Bonaci 
>> >wrote:
>> >>>
>> >>> > Abdullah, welcome!
>> >>> >
>> >>> > I think it's rather clear we've been struggling with the
>website,
>> >so any
>> >>> > help is very welcome. It is a little bit messy right now --
>there
>> >are a
>> >>> few
>> >>> > outstanding pull requests and forked branches. I'm trying to
>get
>> >all
>> >>> this
>> >>> > into one place, so anybody can contribute and make progress.
>> >>> >
>> >>> > Also, the general website organization has been discussed
>before,
>> >see
>> >>> this
>> >>> > thread [1] and the attached document for details.
>> >>> >
>> >>> > Davor
>> >>> >
>> >>> > [1]
>> >>> > https://mail-archives.apache.org/mod_mbox/beam-dev/201606.
>> >>> > mbox/%3CCAAzyFAwu992x+xcxN6Ha-avKZZbF-RK00mUg1-vezYCmtOm4Ww@
>> >>> > mail.gmail.com%3E
>> >>> >
>> >>> > On Sun, Oct 23, 2016 at 12:34 AM, Jean-Baptiste Onofré
>> >> >>> >
>> >>> > wrote:
>> >>> >
>> >>> > > Hi
>> >>> > >
>> >>> > > You can take a look on the PR I creates last Friday. It
>contains
>> >a
>> >>> > > CSS/skin proposal.
>> >>> > >
>> >>> > > The mock-up is there: http://maven.nanthrax.net/beam
>> >>> > >
>> >>> > > Regards
>> >>> > > JB
>> >>> > >
>> >>> > > ⁣​
>> >>> > >
>> >>> > > On Oct 23, 2016, 09:27, at 09:27, Abdullah Bashir <
>> >>> > mabdullah...@gmail.com>
>> >>> > > wrote:
>> >>> > > >Hi,
>> >>> > > >
>> >>> > > >is their any help i can do on website designing ?
>> >>> > > >I am good at HTML5, CSS3 and javascript.
>> >>> > > >
>> >>> > > >Regards,
>> >>> > > >Abdullah Bashir
>> >>> > >
>> >>> >
>> >>>
>> >>
>> >>
>>
>
>
>
>-- 
>*Minudika Malshan*
>Undergraduate
>Department of Computer Science and Engineering
>University of Moratuwa
>Sri Lanka.


Re: [PROPOSAL] New Beam website design?

2016-10-27 Thread Jean-Baptiste Onofré
Hi

You can propose a PR on this Jira.

We will be more than happy to review it.

Thanks
Regards
JB

⁣​

On Oct 27, 2016, 11:26, at 11:26, Abdullah Bashir  
wrote:
>Thank you very much for taking time to respond Davor :)
>
>Regarding BEAM-752, i can work on that, i have already built some
>Dataflow
>Piplines on Google Cloud in Python language.
>
>Again Can you tell me where to start for BEAM-752. I am new to ASF
>contribution, so onboarding steps are kind of a black box to me :).
>
>On Thu, Oct 27, 2016 at 11:34 AM, Davor Bonaci 
>wrote:
>
>> Absolutely!
>>
>> I'm currently reviewing JB's PR #51, and that should go in shortly.
>Within
>> a day or so, I should have a better idea about future work in this
>specific
>> area; please stay tuned.
>>
>> There are also separate things that are ready to be started at any
>time.
>> BEAM-752 comes to mind first. Is this something you'd be interested
>in?
>>
>> On Wed, Oct 26, 2016 at 11:17 PM, Abdullah Bashir
>
>> wrote:
>>
>>> Hi Davor,
>>>
>>> I am done with my local setup to start contributing, I have forked
>and
>>> merged pull request *(**pull/51)* into my  local repo. Then I read
>the
>>>
>>> google docs, their are two tasks mentioned in it, as [Beam-500] and
>>> [Beam-501].
>>> I found out that [Beam-500] is closed in JIRA and [Beam-501] is
>>> assigned to Jean-Baptiste
>>> Onofré, Is their any task that you can assign to me ?
>>>
>>> Thanks.
>>>
>>> Regards,
>>> Abdullah Bashir
>>>
>>>
>>> On Tue, Oct 25, 2016 at 1:50 AM, Davor Bonaci 
>wrote:
>>>
>>> > Abdullah, welcome!
>>> >
>>> > I think it's rather clear we've been struggling with the website,
>so any
>>> > help is very welcome. It is a little bit messy right now -- there
>are a
>>> few
>>> > outstanding pull requests and forked branches. I'm trying to get
>all
>>> this
>>> > into one place, so anybody can contribute and make progress.
>>> >
>>> > Also, the general website organization has been discussed before,
>see
>>> this
>>> > thread [1] and the attached document for details.
>>> >
>>> > Davor
>>> >
>>> > [1]
>>> > https://mail-archives.apache.org/mod_mbox/beam-dev/201606.
>>> > mbox/%3CCAAzyFAwu992x+xcxN6Ha-avKZZbF-RK00mUg1-vezYCmtOm4Ww@
>>> > mail.gmail.com%3E
>>> >
>>> > On Sun, Oct 23, 2016 at 12:34 AM, Jean-Baptiste Onofré
>>> >
>>> > wrote:
>>> >
>>> > > Hi
>>> > >
>>> > > You can take a look on the PR I creates last Friday. It contains
>a
>>> > > CSS/skin proposal.
>>> > >
>>> > > The mock-up is there: http://maven.nanthrax.net/beam
>>> > >
>>> > > Regards
>>> > > JB
>>> > >
>>> > > ⁣​
>>> > >
>>> > > On Oct 23, 2016, 09:27, at 09:27, Abdullah Bashir <
>>> > mabdullah...@gmail.com>
>>> > > wrote:
>>> > > >Hi,
>>> > > >
>>> > > >is their any help i can do on website designing ?
>>> > > >I am good at HTML5, CSS3 and javascript.
>>> > > >
>>> > > >Regards,
>>> > > >Abdullah Bashir
>>> > >
>>> >
>>>
>>
>>


Re: [VOTE] Release 0.3.0-incubating, release candidate #1

2016-10-27 Thread Jean-Baptiste Onofré
No problem for the vote.

For graduation, we are already thinking about it yes.

Regards
JB

⁣​

On Oct 27, 2016, 08:54, at 08:54, "Sergio Fernández"  wrote:
>Hi JB,
>
>On Tue, Oct 25, 2016 at 12:00 PM, Jean-Baptiste Onofré
>
>wrote:
>
>> Thanks Sergio ;)
>>
>
>You are welcome.
>
>
>> Just tried to explain to the others what is a binding vote ;)
>>
>
>It's a common mistake in many podlings that PPMC members thing they
>have
>binding votes over developers who are not part of the project. But
>during
>incubation only IPMC are binding votes. I hope that's clear.
>
>In theory it's simple. So sorry if I've made some noise with that. I'll
>repeat my vote later at general@incubator if you prefer it in that way.
>
>Cheers,
>
>P.S.: after 0.3.0-incubating, are you thinking about graduation? I
>think
>you should ;-)
>
>
>
>On Oct 25, 2016, 11:53, at 11:53, "Sergio Fernández"
>
>> wrote:
>> >On Tue, Oct 25, 2016 at 11:36 AM, Jean-Baptiste Onofré
>> >
>> >wrote:
>> >
>> >> By the way, your vote is not binding from a podling perspective
>(you
>> >are
>> >> not PPMC). Your vote is binding from IPMC perspective (so when you
>> >will
>> >> vote on the incubator mailing list).
>> >>
>> >
>> >Well, PPMC are never binding votes, only IPMC are actually binding.
>> >That
>> >I'm not part of the PPMC is not much relevant. Therefore I think my
>> >vote is
>> >still a valid binding one; but I can vote again on
>general@incubator,
>> >no
>> >problem.
>> >
>> >Sorry for jumping-in too early. Besides a IPMC, I'm also a developer
>> >interested in Beam ;-)
>> >
>> >Cheers,
>> >
>> >
>> >
>> >
>> >> On Oct 25, 2016, 11:33, at 11:33, "Sergio Fernández"
>> >
>> >> wrote:
>> >> >+1 (binding)
>> >> >
>> >> >So far I've successfully checked:
>> >> >* signatures and digests
>> >> >* source releases file layouts
>> >> >* matched git tags and commit ids
>> >> >* incubator suffix and disclaimer
>> >> >* NOTICE and LICENSE files
>> >> >* license headers
>> >> >* clean build (Java 1.8.0_91, Scala, 2.11.7, SBT 0.13.9, Debian
>> >amd64)
>> >> >
>> >> >
>> >> >Couple of minor issues I've seen it'd be great to have fixed in
>> >> >upcoming
>> >> >releases:
>> >> >* MongoDbIOTest fails (addr already in use) when a Mongo service
>is
>> >> >locally
>> >> >running. I'd say the port should be random in the test suite.
>> >> >* How did you generated the checksums? Because both SHA1/MD5
>can't
>> >be
>> >> >automatically checked because "no properly formatted SHA1/MD5
>> >checksum
>> >> >lines found".
>> >> >
>> >> >Great to see the project moving forward at this speed :-)
>> >> >
>> >> >Cheers,
>> >> >
>> >> >
>> >> >
>> >> >On Mon, Oct 24, 2016 at 11:30 PM, Aljoscha Krettek
>> >> >
>> >> >wrote:
>> >> >
>> >> >> Hi Team!
>> >> >>
>> >> >> Please review and vote at your leisure on release candidate #1
>for
>> >> >version
>> >> >> 0.3.0-incubating, as follows:
>> >> >> [ ] +1, Approve the release
>> >> >> [ ] -1, Do not approve the release (please provide specific
>> >comments)
>> >> >>
>> >> >> The complete staging area is available for your review, which
>> >> >includes:
>> >> >> * JIRA release notes [1],
>> >> >> * the official Apache source release to be deployed to
>> >> >dist.apache.org
>> >> >> [2],
>> >> >> * all artifacts to be deployed to the Maven Central Repository
>> >[3],
>> >> >> * source code tag "v0.3.0-incubating-RC1" [4],
>> >> >> * website pull request listing the release and publishing the
>API
>> >> >reference
>> >> >> manual [5].
>> >> >>
>> >> >> Please keep in mind that this release is not focused on
>providing
>> >new
>> >> >> functionality. We want to refine the release process and make
>> >stable
>> >> >source
>> >> >> and binary artefacts available to our users.
>> >> >>
>> >> >> The vote will be open for at least 72 hours. It is adopted by
>> >> >majority
>> >> >> approval, with at least 3 PPMC affirmative votes.
>> >> >>
>> >> >> Cheers,
>> >> >> Aljoscha
>> >> >>
>> >> >> [1]
>> >> >> https://issues.apache.org/jira/secure/ReleaseNote.jspa?
>> >> >> projectId=12319527&version=12338051
>> >> >> [2]
>https://dist.apache.org/repos/dist/dev/incubator/beam/0.3.0-
>> >> >> incubating/
>> >> >> [3]
>> >> >> https://repository.apache.org/content/repositories/staging/
>> >> >> org/apache/beam/
>> >> >> [4]
>> >> >>
>>
>>https://git-wip-us.apache.org/repos/asf?p=incubator-beam.git;a=tag;h=
>> >> >> 5d86ff7f04862444c266142b0d5acecb5a6b7144
>> >> >> [5] https://github.com/apache/incubator-beam-site/pull/52
>> >> >>
>> >> >
>> >> >
>> >> >
>> >> >--
>> >> >Sergio Fernández
>> >> >Partner Technology Manager
>> >> >Redlink GmbH
>> >> >m: +43 6602747925
>> >> >e: sergio.fernan...@redlink.co
>> >> >w: http://redlink.co
>> >>
>> >
>> >
>> >
>> >--
>> >Sergio Fernández
>> >Partner Technology Manager
>> >Redlink GmbH
>> >m: +43 6602747925
>> >e: sergio.fernan...@redlink.co
>> >w: http://redlink.co
>>
>
>
>
>-- 
>Sergio Fernández
>Partner Technology Manager
>Redlink GmbH
>m: +43 6602747925
>e: sergio.fernan...@redlink.co
>w: http://redlink.co


  1   2   3   4   5   >