Build failed in Jenkins: beam_Release_NightlySnapshot #330

2017-02-15 Thread Apache Jenkins Server
See 

Changes:

[altay] Make examples blocking as command line invoked

[ansela] [BEAM-774] Implement Metrics support for Spark runner

[ansela] Register beam metrics with a MetricSource in Spark

[ansela] Remove duplicate classes from spark runner marking sdk classes

[ansela] Throw UnsupportedOperationException for committed metrics results in

[ansela] Recover metrics values from checkpoint

[altay] Add unsigned 64 bit int read/write methods to cythonized stream

[klk] Upgrade bytebuddy to 1.6.8 to jump past asm 5.0

[peihe] [BEAM-59] Beam GcsFileSystem: port expand() from GcsUtil for glob

[tgroh] Add runners/core-construction-java

[klk] Fix some DoFn javadoc literals

--
[...truncated 7320 lines...]
2017-02-16T07:18:30.218 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpcore-nio/4.4.5/httpcore-nio-4.4.5.jar
2017-02-16T07:18:30.218 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpcore/4.4.5/httpcore-4.4.5.jar
2017-02-16T07:18:30.246 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/elasticsearch/client/rest/5.0.0/rest-5.0.0.jar
 (32 KB at 1004.3 KB/sec)
2017-02-16T07:18:30.246 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.2/httpasyncclient-4.1.2.jar
2017-02-16T07:18:30.254 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/com/google/code/gson/gson/2.6.2/gson-2.6.2.jar
 (225 KB at 5606.7 KB/sec)
2017-02-16T07:18:30.254 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/elasticsearch/elasticsearch/2.4.1/elasticsearch-2.4.1.jar
2017-02-16T07:18:30.261 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/commons-codec/commons-codec/1.10/commons-codec-1.10.jar
 (278 KB at 6033.1 KB/sec)
2017-02-16T07:18:30.261 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-core/5.5.2/lucene-core-5.5.2.jar
2017-02-16T07:18:30.263 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpcore/4.4.5/httpcore-4.4.5.jar
 (320 KB at 6950.0 KB/sec)
2017-02-16T07:18:30.263 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-backward-codecs/5.5.2/lucene-backward-codecs-5.5.2.jar
2017-02-16T07:18:30.264 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpcore-nio/4.4.5/httpcore-nio-4.4.5.jar
 (348 KB at 7401.1 KB/sec)
2017-02-16T07:18:30.264 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-analyzers-common/5.5.2/lucene-analyzers-common-5.5.2.jar
2017-02-16T07:18:30.278 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.2/httpasyncclient-4.1.2.jar
 (173 KB at 2882.7 KB/sec)
2017-02-16T07:18:30.278 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-queries/5.5.2/lucene-queries-5.5.2.jar
2017-02-16T07:18:30.333 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-queries/5.5.2/lucene-queries-5.5.2.jar
 (246 KB at 2134.9 KB/sec)
2017-02-16T07:18:30.333 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-memory/5.5.2/lucene-memory-5.5.2.jar
2017-02-16T07:18:30.351 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-backward-codecs/5.5.2/lucene-backward-codecs-5.5.2.jar
 (421 KB at 3161.7 KB/sec)
2017-02-16T07:18:30.351 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-highlighter/5.5.2/lucene-highlighter-5.5.2.jar
2017-02-16T07:18:30.360 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-memory/5.5.2/lucene-memory-5.5.2.jar
 (34 KB at 232.5 KB/sec)
2017-02-16T07:18:30.360 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-queryparser/5.5.2/lucene-queryparser-5.5.2.jar
2017-02-16T07:18:30.386 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-analyzers-common/5.5.2/lucene-analyzers-common-5.5.2.jar
 (1540 KB at 9164.5 KB/sec)
2017-02-16T07:18:30.386 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-sandbox/5.5.2/lucene-sandbox-5.5.2.jar
2017-02-16T07:18:30.386 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-highlighter/5.5.2/lucene-highlighter-5.5.2.jar
 (142 KB at 840.5 KB/sec)
2017-02-16T07:18:30.386 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-suggest/5.5.2/lucene-suggest-5.5.2.jar
2017-02-16T07:18:30.423 [INFO] Downloaded: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-queryparser/5.5.2/lucene-queryparser-5.5.2.jar
 (393 KB at 1916.4 KB/sec)
2017-02-16T07:18:30.423 [INFO] Downloading: 
https://repo.maven.apache.org/maven2/org/apache/lucene/lucene-misc/5.5.2/lucene-misc-5.5.2.jar
201

Pipeline Surgery and an interception-free future

2017-02-15 Thread Thomas Groh
As of Github PR #1998 (https://github.com/apache/beam/pull/1998), the new
Pipeline Surgery API is ready and available. There are a couple of
refinements coming in PR #2006, but in general pipelines can now, post
construction, have PTransforms swapped out to whatever the runner desires
(standard "behavior-maintaining" caveats apply).

Moving forwards, this will enable pipelines to be run on multiple runners
without having to reconstruct the graph via repeated applications of
PTransforms to the pipeline (this also includes being able to, for example,
read a pipeline from a serialized representation, and executing the result
on an arbitrary runner).

Those of you who are runner authors (at least, those who I can easily
identify as such) should expect a Pull Request from me sometime next week
porting you off of intercepting calls to apply and to the new surgery API.
You are, of course, welcome to beat me to the punch.

Thanks,

Thomas


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Raghu Angadi
Dipti,

Also how about calling it just HadoopIO?

On Wed, Feb 15, 2017 at 11:13 AM, Raghu Angadi  wrote:

> I skimmed through HdfsIO and I think it is essentially HahdoopInpuFormatIO
> with FileInputFormat. I would pretty much move most of the code to
> HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO).
>
> On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni <
> dipti_dkulka...@persistent.com> wrote:
>
>> Hello there!
>> I am working on writing a Read IO for Hadoop InputFormat. This will
>> enable reading from any datasource which supports Hadoop InputFormat, i.e.
>> provides source to read from InputFormat for integration with Hadoop.
>> It makes sense for the HadoopInputFormatIO to share some code with the
>> HdfsIO - WritableCoder in particular, but also some helper classes like
>> SerializableSplit etc. I was wondering if we could move HDFS and
>> HadoopInputFormat into a shared module for Hadoop IO in general instead of
>> maintaining them separately.
>> Do let me know on what you think, please let me know if you can think of
>> any other ideas too.
>>
>> Thanks,
>> Dipti
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which is
>> the property of Persistent Systems Ltd. It is intended only for the use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy, print,
>> distribute or use this message. If you have received this communication in
>> error, please notify the sender and delete all copies of this message.
>> Persistent Systems Ltd. does not accept any liability for virus infected
>> mails.
>>
>>
>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Stephen Sisk
Hi dipti!

It sounds like there are two possible implementation options:
1. HdfsIO that is implemented using HadoopInputFormatIO
2. HdfsIO that is implemented using IOChannelFactory (I think
BeamFileSystem is the new name?)

Either way, I agree that it makes sense to have one module that contains
the IO transforms that rely on Hadoop, so it sounds like merging them is a
good path forward.

What I'm not sure is whether we agree that dipti should submit the change
without having to refactor HdfsIO? (with some obvious, easy refactoring to
eg. use WritableCoder/serializableSplit?) I don't want to create too much
additional work, but if the correct implementation is #1 (HdfsIO uses
HIFIO), then it seems like the right time to do that would probably be now
given how much code is shared there. If the correct answer is #2, then I
don't think we should do that refactoring now.

S

On Wed, Feb 15, 2017 at 11:27 AM Jean-Baptiste Onofré 
wrote:

> Hi
>
> I guess your saw my comment in the PR. Basically I was waiting the
> refactoring of IOChannelFactory to refactore hdfs IO as hadoop file format
> on top of IOChannelFactory. I would have wait a bit and I would be more
> than happy to help you on the PR.
>
> Regards
> JB
>
> On Feb 15, 2017, 14:55, at 14:55, Dipti Kulkarni <
> dipti_dkulka...@persistent.com> wrote:
> >Hello there!
> >I am working on writing a Read IO for Hadoop InputFormat. This will
> >enable reading from any datasource which supports Hadoop InputFormat,
> >i.e. provides source to read from InputFormat for integration with
> >Hadoop.
> >It makes sense for the HadoopInputFormatIO to share some code with the
> >HdfsIO - WritableCoder in particular, but also some helper classes like
> >SerializableSplit etc. I was wondering if we could move HDFS and
> >HadoopInputFormat into a shared module for Hadoop IO in general instead
> >of maintaining them separately.
> >Do let me know on what you think, please let me know if you can think
> >of any other ideas too.
> >
> >Thanks,
> >Dipti
> >
> >
> >DISCLAIMER
> >==
> >This e-mail may contain privileged and confidential information which
> >is the property of Persistent Systems Ltd. It is intended only for the
> >use of the individual or entity to which it is addressed. If you are
> >not the intended recipient, you are not authorized to read, retain,
> >copy, print, distribute or use this message. If you have received this
> >communication in error, please notify the sender and delete all copies
> >of this message. Persistent Systems Ltd. does not accept any liability
> >for virus infected mails.
>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Jean-Baptiste Onofré
Hi

I guess your saw my comment in the PR. Basically I was waiting the refactoring 
of IOChannelFactory to refactore hdfs IO as hadoop file format on top of 
IOChannelFactory. I would have wait a bit and I would be more than happy to 
help you on the PR.

Regards
JB

On Feb 15, 2017, 14:55, at 14:55, Dipti Kulkarni 
 wrote:
>Hello there!
>I am working on writing a Read IO for Hadoop InputFormat. This will
>enable reading from any datasource which supports Hadoop InputFormat,
>i.e. provides source to read from InputFormat for integration with
>Hadoop.
>It makes sense for the HadoopInputFormatIO to share some code with the
>HdfsIO - WritableCoder in particular, but also some helper classes like
>SerializableSplit etc. I was wondering if we could move HDFS and
>HadoopInputFormat into a shared module for Hadoop IO in general instead
>of maintaining them separately.
>Do let me know on what you think, please let me know if you can think
>of any other ideas too.
>
>Thanks,
>Dipti
>
>
>DISCLAIMER
>==
>This e-mail may contain privileged and confidential information which
>is the property of Persistent Systems Ltd. It is intended only for the
>use of the individual or entity to which it is addressed. If you are
>not the intended recipient, you are not authorized to read, retain,
>copy, print, distribute or use this message. If you have received this
>communication in error, please notify the sender and delete all copies
>of this message. Persistent Systems Ltd. does not accept any liability
>for virus infected mails.


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Jean-Baptiste Onofré
Hi

It's what I said in the hadoop file format PR.
When I discussed with Davor and Pei about the refactoring of the 
IOChannelFactory, I proposed to refactore hdfs IO to deal with hadoop file 
format on top of the file IO.

Regards
JB

On Feb 15, 2017, 15:13, at 15:13, Raghu Angadi  
wrote:
>I skimmed through HdfsIO and I think it is essentially
>HahdoopInpuFormatIO
>with FileInputFormat. I would pretty much move most of the code to
>HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO).
>
>On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni <
>dipti_dkulka...@persistent.com> wrote:
>
>> Hello there!
>> I am working on writing a Read IO for Hadoop InputFormat. This will
>enable
>> reading from any datasource which supports Hadoop InputFormat, i.e.
>> provides source to read from InputFormat for integration with Hadoop.
>> It makes sense for the HadoopInputFormatIO to share some code with
>the
>> HdfsIO - WritableCoder in particular, but also some helper classes
>like
>> SerializableSplit etc. I was wondering if we could move HDFS and
>> HadoopInputFormat into a shared module for Hadoop IO in general
>instead of
>> maintaining them separately.
>> Do let me know on what you think, please let me know if you can think
>of
>> any other ideas too.
>>
>> Thanks,
>> Dipti
>>
>>
>> DISCLAIMER
>> ==
>> This e-mail may contain privileged and confidential information which
>is
>> the property of Persistent Systems Ltd. It is intended only for the
>use of
>> the individual or entity to which it is addressed. If you are not the
>> intended recipient, you are not authorized to read, retain, copy,
>print,
>> distribute or use this message. If you have received this
>communication in
>> error, please notify the sender and delete all copies of this
>message.
>> Persistent Systems Ltd. does not accept any liability for virus
>infected
>> mails.
>>
>>


Re: Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Raghu Angadi
I skimmed through HdfsIO and I think it is essentially HahdoopInpuFormatIO
with FileInputFormat. I would pretty much move most of the code to
HadoopInputFormatIO (just make HdfsIO a specific instance of HIF_IO).

On Wed, Feb 15, 2017 at 9:15 AM, Dipti Kulkarni <
dipti_dkulka...@persistent.com> wrote:

> Hello there!
> I am working on writing a Read IO for Hadoop InputFormat. This will enable
> reading from any datasource which supports Hadoop InputFormat, i.e.
> provides source to read from InputFormat for integration with Hadoop.
> It makes sense for the HadoopInputFormatIO to share some code with the
> HdfsIO - WritableCoder in particular, but also some helper classes like
> SerializableSplit etc. I was wondering if we could move HDFS and
> HadoopInputFormat into a shared module for Hadoop IO in general instead of
> maintaining them separately.
> Do let me know on what you think, please let me know if you can think of
> any other ideas too.
>
> Thanks,
> Dipti
>
>
> DISCLAIMER
> ==
> This e-mail may contain privileged and confidential information which is
> the property of Persistent Systems Ltd. It is intended only for the use of
> the individual or entity to which it is addressed. If you are not the
> intended recipient, you are not authorized to read, retain, copy, print,
> distribute or use this message. If you have received this communication in
> error, please notify the sender and delete all copies of this message.
> Persistent Systems Ltd. does not accept any liability for virus infected
> mails.
>
>


Merge HadoopInputFormatIO and HDFSIO in a single module

2017-02-15 Thread Dipti Kulkarni
Hello there!
I am working on writing a Read IO for Hadoop InputFormat. This will enable 
reading from any datasource which supports Hadoop InputFormat, i.e. provides 
source to read from InputFormat for integration with Hadoop.
It makes sense for the HadoopInputFormatIO to share some code with the HdfsIO - 
WritableCoder in particular, but also some helper classes like 
SerializableSplit etc. I was wondering if we could move HDFS and 
HadoopInputFormat into a shared module for Hadoop IO in general instead of 
maintaining them separately.
Do let me know on what you think, please let me know if you can think of any 
other ideas too.

Thanks,
Dipti


DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the 
property of Persistent Systems Ltd. It is intended only for the use of the 
individual or entity to which it is addressed. If you are not the intended 
recipient, you are not authorized to read, retain, copy, print, distribute or 
use this message. If you have received this communication in error, please 
notify the sender and delete all copies of this message. Persistent Systems 
Ltd. does not accept any liability for virus infected mails.



Re: Better developer instructions for using Maven?

2017-02-15 Thread Jean-Baptiste Onofré
On Jenkins it's possible to run several jobs in the same time but on different 
executor. That's the easiest way.

Regards
JB

On Feb 15, 2017, 10:15, at 10:15, "Ismaël Mejía"  wrote:
>This question got lost in the discussion, but there is a small
>improvement
>that we can do:
>
>> Just to check, are we doing parallel builds?
>
>We are on jenkins, not in travis, there is an ongoing PR to fix this.
>
>What we can improve is to check if we can run some of the test suites
>in
>parallel to gain some extra time. For exemple most of the IOs and some
>runners don't execute tests in parallel.
>
>Ismael
>
>(slightly related), is there a way to get change the timeout of travis
>jobs). Because it is failing most of the time now because of this, and
>it
>is quite noisey to have so many false positives.
>
>
>
>
>On Fri, Feb 10, 2017 at 8:00 PM, Robert Bradshaw <
>rober...@google.com.invalid> wrote:
>
>> On Fri, Feb 10, 2017 at 8:45 AM, Dan Halperin
>> >
>> wrote:
>>
>> > On Fri, Feb 10, 2017 at 7:42 AM, Kenneth Knowles
>> >
>> > wrote:
>> >
>> > > On Feb 10, 2017 07:36, "Dan Halperin"
>
>> > wrote:
>> > >
>> > > Before we added checkstyle it was under a minute. Now it's over
>five?
>> > > That's awful IMO
>> > >
>> > >
>> > > Checkstyle didn't cause all that, did it?
>> > >
>> >
>> > The "5 minutes" was going with Aviem's numbers after this change.
>But
>> yes,
>> > Checkstyle alone substantially (>+50%) the time from what it was
>> previously
>> > to adding it back to the default build.
>>
>>
>> Just to check, are we doing parallel builds?
>>
>>
>> >
>> > Noting that findbugs takes quite a lot more time. Javadoc and jar
>are the
>> > > other two slow ones.
>> > >
>> > > RAT is fast. But it has very poor error messages, so we wouldn't
>want a
>> > new
>> > > contributor trying to figure out what is going on without our
>help.
>> > >
>> >
>> > There is a larger philosophical issue here: is there a point of
>Jenkins
>> > precommit testing? Why not just make `mvn install` run everything
>that
>> > Jenkins does? For that matter, why don't committers just push
>directly to
>> > master? Wouldn't that make everyone's life easier?
>> >
>> > I'd argue that's not true.
>> >
>> > 1. Developer productivity -- Jenkins should run many more checks
>than
>> > developers do. Especially time-, resource-, or setup- intensive
>tasks.
>> > 2. Automated enforcement -- Jenkins is better at running the right
>> commands
>> > than we are.
>> > 3. Lower the barrier to entry -- individual developers need not
>have a
>> > running Spark/Flink/Apex/Dataflow setup in order to contribute
>code.
>> > 4. Focus on the user -- someone checking out the code and using it
>for
>> the
>> > first time does not care whether the code style checks or has the
>right
>> > licenses -- that should have been enforced by the Beam team before
>> > committing.
>> >
>> > We should be *very* choosy about what we enforce on every developer
>every
>> > time they go to compile. I probably compile Beam 50x-100x a day.
>> Literally,
>> > the extra minutes you want to add here will cost me an hour daily.
>> >
>>
>> By the same token of having a different bar for the Jenkins presubmit
>vs.
>> what's run locally, I think it makes a lot of sense to run a
>different
>> command for iterative development than you run before creating a pull
>> request. E.g. during development I'll often run only one test rather
>than
>> the entire suite, but do run the entire suite occasionally (often
>before
>> commit, especially before pushing).
>>
>> The contributors guild should give a suggested command to run before
>> creating a PR, right in the docs of how to create a PR, which may be
>more
>> expensive than what you run during development. IMHO, this should be
>fairly
>> comprehensive (certainly tests and checkstyle, maybe javadoc and
>findbugs).
>> This should be the "default" command that the one-time-contributor
>should
>> know. For those compiling 50x or more a day, I think the burden of
>learning
>> a second (or more) cheaper commands is not high, and we could even
>put such
>> a thing in the docs (and hopefully a common maven convention like
>"mvn
>> test").
>>
>> I've listed the fraction of commits I think will break one of the
>following
>> > if that property is not tested:
>> >
>> > * compiling (100%)
>> > * tests (100%)
>> > * checkstyle (90%)
>> > * javadoc (30%)
>> > * findbugs (5%)
>> > * rat (1%)
>> >
>> > So you can see where I stand and why. I'm sorry that 1/20 PRs has
>Apache
>> > RAT catch a licensing issue or Findbugs catch a threading issue --
>you
>> can
>> > always get a larger set of the precommit checks using -Prelease,
>though
>> of
>> > course the integration tests and runnableonservice tests may catch
>more
>> > issues still. But I want my developer minutes back for the 95%+ of
>the
>> > rest.
>> >
>> > Dan
>> >
>>


Re: Better developer instructions for using Maven?

2017-02-15 Thread Ismaël Mejía
This question got lost in the discussion, but there is a small improvement
that we can do:

> Just to check, are we doing parallel builds?

We are on jenkins, not in travis, there is an ongoing PR to fix this.

What we can improve is to check if we can run some of the test suites in
parallel to gain some extra time. For exemple most of the IOs and some
runners don't execute tests in parallel.

Ismael

(slightly related), is there a way to get change the timeout of travis
jobs). Because it is failing most of the time now because of this, and it
is quite noisey to have so many false positives.




On Fri, Feb 10, 2017 at 8:00 PM, Robert Bradshaw <
rober...@google.com.invalid> wrote:

> On Fri, Feb 10, 2017 at 8:45 AM, Dan Halperin  >
> wrote:
>
> > On Fri, Feb 10, 2017 at 7:42 AM, Kenneth Knowles  >
> > wrote:
> >
> > > On Feb 10, 2017 07:36, "Dan Halperin" 
> > wrote:
> > >
> > > Before we added checkstyle it was under a minute. Now it's over five?
> > > That's awful IMO
> > >
> > >
> > > Checkstyle didn't cause all that, did it?
> > >
> >
> > The "5 minutes" was going with Aviem's numbers after this change. But
> yes,
> > Checkstyle alone substantially (>+50%) the time from what it was
> previously
> > to adding it back to the default build.
>
>
> Just to check, are we doing parallel builds?
>
>
> >
> > Noting that findbugs takes quite a lot more time. Javadoc and jar are the
> > > other two slow ones.
> > >
> > > RAT is fast. But it has very poor error messages, so we wouldn't want a
> > new
> > > contributor trying to figure out what is going on without our help.
> > >
> >
> > There is a larger philosophical issue here: is there a point of Jenkins
> > precommit testing? Why not just make `mvn install` run everything that
> > Jenkins does? For that matter, why don't committers just push directly to
> > master? Wouldn't that make everyone's life easier?
> >
> > I'd argue that's not true.
> >
> > 1. Developer productivity -- Jenkins should run many more checks than
> > developers do. Especially time-, resource-, or setup- intensive tasks.
> > 2. Automated enforcement -- Jenkins is better at running the right
> commands
> > than we are.
> > 3. Lower the barrier to entry -- individual developers need not have a
> > running Spark/Flink/Apex/Dataflow setup in order to contribute code.
> > 4. Focus on the user -- someone checking out the code and using it for
> the
> > first time does not care whether the code style checks or has the right
> > licenses -- that should have been enforced by the Beam team before
> > committing.
> >
> > We should be *very* choosy about what we enforce on every developer every
> > time they go to compile. I probably compile Beam 50x-100x a day.
> Literally,
> > the extra minutes you want to add here will cost me an hour daily.
> >
>
> By the same token of having a different bar for the Jenkins presubmit vs.
> what's run locally, I think it makes a lot of sense to run a different
> command for iterative development than you run before creating a pull
> request. E.g. during development I'll often run only one test rather than
> the entire suite, but do run the entire suite occasionally (often before
> commit, especially before pushing).
>
> The contributors guild should give a suggested command to run before
> creating a PR, right in the docs of how to create a PR, which may be more
> expensive than what you run during development. IMHO, this should be fairly
> comprehensive (certainly tests and checkstyle, maybe javadoc and findbugs).
> This should be the "default" command that the one-time-contributor should
> know. For those compiling 50x or more a day, I think the burden of learning
> a second (or more) cheaper commands is not high, and we could even put such
> a thing in the docs (and hopefully a common maven convention like "mvn
> test").
>
> I've listed the fraction of commits I think will break one of the following
> > if that property is not tested:
> >
> > * compiling (100%)
> > * tests (100%)
> > * checkstyle (90%)
> > * javadoc (30%)
> > * findbugs (5%)
> > * rat (1%)
> >
> > So you can see where I stand and why. I'm sorry that 1/20 PRs has Apache
> > RAT catch a licensing issue or Findbugs catch a threading issue -- you
> can
> > always get a larger set of the precommit checks using -Prelease, though
> of
> > course the integration tests and runnableonservice tests may catch more
> > issues still. But I want my developer minutes back for the 95%+ of the
> > rest.
> >
> > Dan
> >
>


Re: Stream and Batch Use Case

2017-02-15 Thread Amit Sela
Composites are very much supported, the guide is in progress ;-)
You can see for example the CountWords

 composite.

Not sure what you mean by "performance", could you elaborate on that please
?

The Spark runner currently supports batch, streaming support is being
rolled-out these days so you should keep track.

On Wed, Feb 15, 2017 at 10:42 AM ankit beohar 
wrote:

> Amit
>
> Thanks for your fast response now I got it, my use case will solve using
> composite transforms (which is in progress I guess).
> But if I twist my logic and put in a way you mentioned to just use
> different I/O and run on top of SPARK then I guess BEAM will handle batch
> and streaming performance issue right?
>
> Best Regards,
> ANKIT BEOHAR
>
>
> On Wed, Feb 15, 2017 at 2:03 PM, Amit Sela  wrote:
>
> > Oh, missed your question on which one is better it really depends on
> > your use case.
> > If the data is homogenous, and you want to write to the same IO, I don't
> > see a reason not to Flatten them into one PCollection.
> > If you want to write files-to-files and Kafka-to-Kafka you might be
> better
> > off with two separate pipelines, batch and streaming. And to make things
> > even more elegant you could "compact" your (common) series of
> > transformations into a single composite transform such that you end-up
> with
> > something like:
> >
> > *lines.apply(MyComposite)*
> > *moreLines.apply(MyComposite)*
> >
> > Composite transforms programming guide is still under construction,
> should
> > be available here once ready :
> > https://beam.apache.org/documentation/programming-
> > guide/#transforms-composite
> >
> >
> > On Wed, Feb 15, 2017 at 10:28 AM Amit Sela  wrote:
> >
> > > You can write one pipeline and simply replace the IO, for example:
> > >
> > > To read from (text) files you can use:
> > > *PCollection lines =
> > > p.apply(TextIO.Read.from("file://some/inputData.txt"));*
> > >
> > > and from Kafka (I'm adding a generic key here because Kafka messages
> are
> > > keyed):
> > > *PCollection> moreLines = p,apply(*
> > > *KafkaIO.read()*
> > > *.withBootstrapServers("brokers.list")*
> > > *.withTopics("topic-list")*
> > > *.withKeyCoder(Coder)*
> > > *.withValueCoder(StringUtf8Coder.of()));*
> > >
> > > Now you can apply the same code to both PCollections, or (as you
> > > mentioned) you can Flatten the together into one PCollection (after
> > > removing the keys from Kafka-read PCollection) and apply the
> > > transformations you want.
> > >
> > > You might find the IO section in the programming guide useful:
> > > https://beam.apache.org/documentation/programming-guide/#io
> > >
> > >
> > > On Wed, Feb 15, 2017 at 10:13 AM ankit beohar  >
> > > wrote:
> > >
> > > Hi All,
> > >
> > > I have a use case where I have kafka and flat files so can I write one
> > code
> > > and run for both or I have to create two different pipelines or use
> > > pipeline join in a one pipeline.
> > >
> > > Which one is better?
> > >
> > > Best Regards,
> > > ANKIT BEOHAR
> > >
> > >
> >
>


Re: Stream and Batch Use Case

2017-02-15 Thread ankit beohar
Amit

Thanks for your fast response now I got it, my use case will solve using
composite transforms (which is in progress I guess).
But if I twist my logic and put in a way you mentioned to just use
different I/O and run on top of SPARK then I guess BEAM will handle batch
and streaming performance issue right?

Best Regards,
ANKIT BEOHAR


On Wed, Feb 15, 2017 at 2:03 PM, Amit Sela  wrote:

> Oh, missed your question on which one is better it really depends on
> your use case.
> If the data is homogenous, and you want to write to the same IO, I don't
> see a reason not to Flatten them into one PCollection.
> If you want to write files-to-files and Kafka-to-Kafka you might be better
> off with two separate pipelines, batch and streaming. And to make things
> even more elegant you could "compact" your (common) series of
> transformations into a single composite transform such that you end-up with
> something like:
>
> *lines.apply(MyComposite)*
> *moreLines.apply(MyComposite)*
>
> Composite transforms programming guide is still under construction, should
> be available here once ready :
> https://beam.apache.org/documentation/programming-
> guide/#transforms-composite
>
>
> On Wed, Feb 15, 2017 at 10:28 AM Amit Sela  wrote:
>
> > You can write one pipeline and simply replace the IO, for example:
> >
> > To read from (text) files you can use:
> > *PCollection lines =
> > p.apply(TextIO.Read.from("file://some/inputData.txt"));*
> >
> > and from Kafka (I'm adding a generic key here because Kafka messages are
> > keyed):
> > *PCollection> moreLines = p,apply(*
> > *KafkaIO.read()*
> > *.withBootstrapServers("brokers.list")*
> > *.withTopics("topic-list")*
> > *.withKeyCoder(Coder)*
> > *.withValueCoder(StringUtf8Coder.of()));*
> >
> > Now you can apply the same code to both PCollections, or (as you
> > mentioned) you can Flatten the together into one PCollection (after
> > removing the keys from Kafka-read PCollection) and apply the
> > transformations you want.
> >
> > You might find the IO section in the programming guide useful:
> > https://beam.apache.org/documentation/programming-guide/#io
> >
> >
> > On Wed, Feb 15, 2017 at 10:13 AM ankit beohar 
> > wrote:
> >
> > Hi All,
> >
> > I have a use case where I have kafka and flat files so can I write one
> code
> > and run for both or I have to create two different pipelines or use
> > pipeline join in a one pipeline.
> >
> > Which one is better?
> >
> > Best Regards,
> > ANKIT BEOHAR
> >
> >
>


Re: Stream and Batch Use Case

2017-02-15 Thread Amit Sela
Oh, missed your question on which one is better it really depends on
your use case.
If the data is homogenous, and you want to write to the same IO, I don't
see a reason not to Flatten them into one PCollection.
If you want to write files-to-files and Kafka-to-Kafka you might be better
off with two separate pipelines, batch and streaming. And to make things
even more elegant you could "compact" your (common) series of
transformations into a single composite transform such that you end-up with
something like:

*lines.apply(MyComposite)*
*moreLines.apply(MyComposite)*

Composite transforms programming guide is still under construction, should
be available here once ready :
https://beam.apache.org/documentation/programming-guide/#transforms-composite


On Wed, Feb 15, 2017 at 10:28 AM Amit Sela  wrote:

> You can write one pipeline and simply replace the IO, for example:
>
> To read from (text) files you can use:
> *PCollection lines =
> p.apply(TextIO.Read.from("file://some/inputData.txt"));*
>
> and from Kafka (I'm adding a generic key here because Kafka messages are
> keyed):
> *PCollection> moreLines = p,apply(*
> *KafkaIO.read()*
> *.withBootstrapServers("brokers.list")*
> *.withTopics("topic-list")*
> *.withKeyCoder(Coder)*
> *.withValueCoder(StringUtf8Coder.of()));*
>
> Now you can apply the same code to both PCollections, or (as you
> mentioned) you can Flatten the together into one PCollection (after
> removing the keys from Kafka-read PCollection) and apply the
> transformations you want.
>
> You might find the IO section in the programming guide useful:
> https://beam.apache.org/documentation/programming-guide/#io
>
>
> On Wed, Feb 15, 2017 at 10:13 AM ankit beohar 
> wrote:
>
> Hi All,
>
> I have a use case where I have kafka and flat files so can I write one code
> and run for both or I have to create two different pipelines or use
> pipeline join in a one pipeline.
>
> Which one is better?
>
> Best Regards,
> ANKIT BEOHAR
>
>


Re: Stream and Batch Use Case

2017-02-15 Thread Amit Sela
You can write one pipeline and simply replace the IO, for example:

To read from (text) files you can use:
*PCollection lines =
p.apply(TextIO.Read.from("file://some/inputData.txt"));*

and from Kafka (I'm adding a generic key here because Kafka messages are
keyed):
*PCollection> moreLines = p,apply(*
*KafkaIO.read()*
*.withBootstrapServers("brokers.list")*
*.withTopics("topic-list")*
*.withKeyCoder(Coder)*
*.withValueCoder(StringUtf8Coder.of()));*

Now you can apply the same code to both PCollections, or (as you mentioned)
you can Flatten the together into one PCollection (after removing the keys
from Kafka-read PCollection) and apply the transformations you want.

You might find the IO section in the programming guide useful:
https://beam.apache.org/documentation/programming-guide/#io


On Wed, Feb 15, 2017 at 10:13 AM ankit beohar 
wrote:

> Hi All,
>
> I have a use case where I have kafka and flat files so can I write one code
> and run for both or I have to create two different pipelines or use
> pipeline join in a one pipeline.
>
> Which one is better?
>
> Best Regards,
> ANKIT BEOHAR
>


Re: Metrics for Beam IOs.

2017-02-15 Thread Stas Levin
+1 to making the IO metrics (e.g. producers, consumers) available as part
of the Beam pipeline metrics tree for debugging and visibility.

As it has already been mentioned, many IO clients have a metrics mechanism
in place, so in these cases I think it could be beneficial to mirror their
metrics under the relevant subtree of the Beam metrics tree.

On Wed, Feb 15, 2017 at 12:04 AM Amit Sela  wrote:

> I think this is a great discussion and I'd like to relate to some of the
> points raised here, and raise some of my own.
>
> First of all I think we should be careful here not to cross boundaries. IOs
> naturally have many metrics, and Beam should avoid "taking over" those. IO
> metrics should focus on what's relevant to the Pipeline: input/output rate,
> backlog (for UnboundedSources, which exists in bytes but for monitoring
> purposes we might want to consider #messages).
>
> I don't agree that we should not invest in doing this in Sources/Sinks and
> going directly to SplittableDoFn because the IO API is familiar and known,
> and as long as we keep it should be treated as a first class citizen.
>
> As for enable/disable - if IOs consider focusing on pipeline-related
> metrics I think we should be fine, though this could also change between
> runners as well.
>
> Finally, considering "split-metrics" is interesting because on one hand it
> affects the pipeline directly (unbalanced partitions in Kafka that may
> cause backlog) but this is that fine-line of responsibilities (Kafka
> monitoring would probably be able to tell you that partitions are not
> balanced).
>
> My 2 cents, cheers!
>
> On Tue, Feb 14, 2017 at 8:46 PM Raghu Angadi 
> wrote:
>
> > On Tue, Feb 14, 2017 at 9:21 AM, Ben Chambers
>  > >
> > wrote:
> >
> > >
> > > > * I also think there are data source specific metrics that a given IO
> > > will
> > > > want to expose (ie, things like kafka backlog for a topic.)
> >
> >
> > UnboundedSource has API for backlog. It is better for beam/runners to
> > handle backlog as well.
> > Of course there will be some source specific metrics too (errors, i/o ops
> > etc).
> >
>


Stream and Batch Use Case

2017-02-15 Thread ankit beohar
Hi All,

I have a use case where I have kafka and flat files so can I write one code
and run for both or I have to create two different pipelines or use
pipeline join in a one pipeline.

Which one is better?

Best Regards,
ANKIT BEOHAR