Re: Jenkins build is back to normal : beam_Release_NightlySnapshot #131

2016-08-10 Thread Jean-Baptiste Onofré
Good catch.

We can exclude heapdump files in rat config.

Regards
JB



On Aug 11, 2016, 06:14, at 06:14, Dan Halperin  
wrote:
>*sigh*
>
>All I did was clear the workspace and kick the job -- it seems some
>intermediate build created a heap dump and then that caused all future
>builds to fail on Apache RAT.
>
>It would be nice to be able to prevent this type of persistent failure
>from
>happening in the future.
>
>On Wed, Aug 10, 2016 at 8:03 PM, Apache Jenkins Server <
>jenk...@builds.apache.org> wrote:
>
>> See > NightlySnapshot/131/changes>
>>
>>


Re: Jenkins build is back to normal : beam_Release_NightlySnapshot #131

2016-08-10 Thread Dan Halperin
*sigh*

All I did was clear the workspace and kick the job -- it seems some
intermediate build created a heap dump and then that caused all future
builds to fail on Apache RAT.

It would be nice to be able to prevent this type of persistent failure from
happening in the future.

On Wed, Aug 10, 2016 at 8:03 PM, Apache Jenkins Server <
jenk...@builds.apache.org> wrote:

> See  NightlySnapshot/131/changes>
>
>


Jenkins build is back to normal : beam_Release_NightlySnapshot #131

2016-08-10 Thread Apache Jenkins Server
See 



Re: Proposal: Dynamic PIpelineOptions

2016-08-10 Thread Sam McVeety
We can probably build a "real" case around the TextIO boilerplate -- say
that a user wants to regularly run a Beam job with a different input path
according to the day.  TextIO would be modified to support a dynamic value:

TextIO.Read.withFilepattern(ValueSupplier);

... and then the pipeline author would supply this via their own option:

class MyPIpelineOptions extends PipelineOptions {
@Default.RuntimeValueSupplier("gs://bar")
RuntimeValueSupplier getInputPath();
setInputPath(RuntimeValueSupplier);
}

At this point, the same job graph could be reused with different values for
--inputPath.


Cheers,
Sam

On Wed, Aug 10, 2016 at 12:17 PM, Ismaël Mejía  wrote:

> +1 It sounds really nice, (4) is by far the most consistent with the
> current Options implementation.
> One extra thing, maybe it is a good idea to sketch a 'real' use case to
> make the concepts/need more evident.
>
> Ismaël
>
> On Tue, Aug 9, 2016 at 8:49 PM, Sam McVeety 
> wrote:
>
> > Thanks, Amit and JB.  Amit, to your question: the intention with
> > availability to PTransforms is provide the ValueProvider abstraction
> (which
> > may be implemented on top of PipelineOptions) so that they do not take a
> > dependency on PipelineOptions.
> >
> > Cheers,
> > Sam
> >
> > On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré 
> > wrote:
> >
> > > +1
> > >
> > > Thanks Sam, it sounds interesting.
> > >
> > > Regards
> > > JB
> > >
> > >
> > > On 07/29/2016 09:14 PM, Sam McVeety wrote:
> > >
> > >> During the graph construction phase, the given SDK generates an
> initial
> > >> execution graph for the program.  At execution time, this graph is
> > >> executed, either locally or by a service.  Currently, Beam only
> supports
> > >> parameterization at graph construction time.  Both Flink and Spark
> > supply
> > >> functionality that allows a pre-compiled job to be run without SDK
> > >> interaction with updated runtime parameters.
> > >>
> > >> In its current incarnation, Dataflow can read values of
> PipelineOptions
> > at
> > >> job submission time, but this requires the presence of an SDK to
> > properly
> > >> encode these values into the job.  We would like to build a common
> layer
> > >> into the Beam model so that these dynamic options can be properly
> > provided
> > >> to jobs.
> > >>
> > >> Please see
> > >> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> > >> K1r1YAJ90JG5Fz0_28o/edit
> > >> for the high-level model, and
> > >> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> > >> kOSgGi8ZUH-MOnFatZ8/edit
> > >> for
> > >> the specific API proposal.
> > >>
> > >> Cheers,
> > >> Sam
> > >>
> > >>
> > > --
> > > Jean-Baptiste Onofré
> > > jbono...@apache.org
> > > http://blog.nanthrax.net
> > > Talend - http://www.talend.com
> > >
> >
>


Re: Proposal: Dynamic PIpelineOptions

2016-08-10 Thread Ismaël Mejía
+1 It sounds really nice, (4) is by far the most consistent with the
current Options implementation.
One extra thing, maybe it is a good idea to sketch a 'real' use case to
make the concepts/need more evident.

Ismaël

On Tue, Aug 9, 2016 at 8:49 PM, Sam McVeety  wrote:

> Thanks, Amit and JB.  Amit, to your question: the intention with
> availability to PTransforms is provide the ValueProvider abstraction (which
> may be implemented on top of PipelineOptions) so that they do not take a
> dependency on PipelineOptions.
>
> Cheers,
> Sam
>
> On Mon, Aug 8, 2016 at 12:26 AM, Jean-Baptiste Onofré 
> wrote:
>
> > +1
> >
> > Thanks Sam, it sounds interesting.
> >
> > Regards
> > JB
> >
> >
> > On 07/29/2016 09:14 PM, Sam McVeety wrote:
> >
> >> During the graph construction phase, the given SDK generates an initial
> >> execution graph for the program.  At execution time, this graph is
> >> executed, either locally or by a service.  Currently, Beam only supports
> >> parameterization at graph construction time.  Both Flink and Spark
> supply
> >> functionality that allows a pre-compiled job to be run without SDK
> >> interaction with updated runtime parameters.
> >>
> >> In its current incarnation, Dataflow can read values of PipelineOptions
> at
> >> job submission time, but this requires the presence of an SDK to
> properly
> >> encode these values into the job.  We would like to build a common layer
> >> into the Beam model so that these dynamic options can be properly
> provided
> >> to jobs.
> >>
> >> Please see
> >> https://docs.google.com/document/d/1I-iIgWDYasb7ZmXbGBHdok_I
> >> K1r1YAJ90JG5Fz0_28o/edit
> >> for the high-level model, and
> >> https://docs.google.com/document/d/17I7HeNQmiIfOJi0aI70tgGMM
> >> kOSgGi8ZUH-MOnFatZ8/edit
> >> for
> >> the specific API proposal.
> >>
> >> Cheers,
> >> Sam
> >>
> >>
> > --
> > Jean-Baptiste Onofré
> > jbono...@apache.org
> > http://blog.nanthrax.net
> > Talend - http://www.talend.com
> >
>


Build failed in Jenkins: beam_Release_NightlySnapshot #130

2016-08-10 Thread Apache Jenkins Server
See 

Changes:

[klk] Cache .m2 directory on Travis-CI

[klk] Make StreamingPCollectionViewWriterFn and its data public

[fjp] [BEAM-534] Fix dead links in README.md

--
Started by timer
[EnvInject] - Loading node environment variables.
Building remotely on beam2 (beam) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/incubator-beam.git # 
 > timeout=10
Fetching upstream changes from https://github.com/apache/incubator-beam.git
 > git --version # timeout=10
 > git -c core.askpass=true fetch --tags --progress 
 > https://github.com/apache/incubator-beam.git 
 > +refs/heads/*:refs/remotes/origin/*
 > git rev-parse refs/remotes/origin/master^{commit} # timeout=10
 > git rev-parse refs/remotes/origin/origin/master^{commit} # timeout=10
Checking out Revision 063ff2f42290654cefe6c8bc4ea066d94f9aeff6 
(refs/remotes/origin/master)
 > git config core.sparsecheckout # timeout=10
 > git checkout -f 063ff2f42290654cefe6c8bc4ea066d94f9aeff6
 > git rev-list 5049011a2602acc1ce0e1997b467ddee38c66c10 # timeout=10
[EnvInject] - Executing scripts and injecting environment variables after the 
SCM step.
[EnvInject] - Injecting as environment variables the properties content 
SPARK_LOCAL_IP=127.0.0.1

[EnvInject] - Variables injected successfully.
Parsing POMs
Modules changed, recalculating dependency graph
Established TCP socket on 56194
maven32-agent.jar already up to date
maven32-interceptor.jar already up to date
maven3-interceptor-commons.jar already up to date
[beam_Release_NightlySnapshot] $ 
/home/jenkins/jenkins-slave/tools/hudson.model.JDK/jdk1.8.0_66/bin/java -Xmx2g 
-Xms256m -XX:MaxPermSize=512m -cp 
/home/jenkins/jenkins-slave/maven32-agent.jar:/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/maven-3.3.3/boot/plexus-classworlds-2.5.2.jar:/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/maven-3.3.3/conf/logging
 jenkins.maven3.agent.Maven32Main 
/home/jenkins/jenkins-slave/tools/hudson.tasks.Maven_MavenInstallation/maven-3.3.3
 /home/jenkins/jenkins-slave/slave.jar 
/home/jenkins/jenkins-slave/maven32-interceptor.jar 
/home/jenkins/jenkins-slave/maven3-interceptor-commons.jar 56194
Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=512m; 
support was removed in 8.0
<===[JENKINS REMOTING CAPACITY]===>   channel started
Executing Maven:  -B -f 
 
-Dmaven.repo.local=
 -B -e clean deploy -P release -DskipITs=false 
-DintegrationTestPipelineOptions=[ "--project=apache-beam-testing", 
"--tempRoot=gs://temp-storage-for-end-to-end-tests", 
"--runner=org.apache.beam.runners.dataflow.testing.TestDataflowRunner" ]
[INFO] Error stacktraces are turned on.
[INFO] Scanning for projects...
[INFO] 
[INFO] Reactor Build Order:
[INFO] 
[INFO] Apache Beam :: Parent
[INFO] Apache Beam :: SDKs
[INFO] Apache Beam :: SDKs :: Java
[INFO] Apache Beam :: SDKs :: Java :: Build Tools
[INFO] Apache Beam :: SDKs :: Java :: Core
[INFO] Apache Beam :: Runners
[INFO] Apache Beam :: Runners :: Core Java
[INFO] Apache Beam :: Runners :: Direct Java
[INFO] Apache Beam :: Runners :: Google Cloud Dataflow
[INFO] Apache Beam :: SDKs :: Java :: IO
[INFO] Apache Beam :: SDKs :: Java :: IO :: Google Cloud Platform
[INFO] Apache Beam :: SDKs :: Java :: IO :: HDFS
[INFO] Apache Beam :: SDKs :: Java :: IO :: JMS
[INFO] Apache Beam :: SDKs :: Java :: IO :: Kafka
[INFO] Apache Beam :: SDKs :: Java :: Extensions
[INFO] Apache Beam :: SDKs :: Java :: Extensions :: Join library
[INFO] Apache Beam :: SDKs :: Java :: Microbenchmarks
[INFO] Apache Beam :: SDKs :: Java :: Java 8 Tests
[INFO] Apache Beam :: Runners :: Flink
[INFO] Apache Beam :: Runners :: Flink :: Core
[INFO] Apache Beam :: Runners :: Flink :: Examples
[INFO] Apache Beam :: Runners :: Spark
[INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes
[INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes :: Starter
[INFO] Apache Beam :: SDKs :: Java :: Maven Archetypes :: Examples
[INFO] Apache Beam :: Examples
[INFO] Apache Beam :: Examples :: Java
[INFO] Apache Beam :: Examples :: Java 8
[INFO] 
[INFO] 
[INFO] Building Apache Beam :: Parent 0.3.0-incubating-SNAPSHOT
[INFO] 
[INFO] 
[INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ beam-parent ---
[INFO] Deleting 


Re: [Proposal] Pipelines and their executions naming changes.

2016-08-10 Thread Aljoscha Krettek
Hi,
Flink itself allows the user to specify a String when creating a Job, this
will be visible in the web dashboard and maybe some other places. This
would roughly correspond to the proposed PipelineOptions.pipelineName. An
executing job does not have a human-readable name, just an ID that has to
be used when referring to the job and communicating with the master node to
manage the job.

I think the proposed changes are very good. However, it might not be
immediately possible to refer to a running pipeline by its jobName, due to
implementation specifics in the runners.

Cheers,
Aljoscha

On Tue, 9 Aug 2016 at 21:57 Amit Sela  wrote:

> Currently, the Spark runner extends ApplicationNameOptions, PipelineOptions
> and StreamingOptions. Any unification of naming conventions is great IMO,
> and the runner will inherit them as it is.
> As for appName/pipelineName - appName is the same as Spark's app name, but
> I can live happily with pipelineName ;-)
> Considering jobName - that's usually for the resource manager (I use YARN),
> and the proposal sounds great here as well, though I'd have see how I use
> it programmatically because usually I use the submit script.
>
> +1 and thanks Pei!
>
> Sorry for my late response,
> Amit
>
> On Fri, Aug 5, 2016 at 10:55 PM Pei He  wrote:
>
> > Hi all,
> > I have a proposal about how we name pipelines and their executions.
> > The purpose is to clarify the differences between the two, have
> > consensus between runners, and unify the implementation.
> >
> > Current states:
> >  * PipelineOptions.appName defaults to mainClass name
> >  * DataflowPipelineOptions.jobName defaults to appName+user+datetime
> >  * FlinkPipelineOptions.jobName defaults to appName+user+datetime
> >
> > Proposal:
> > 1. Replace PipelineOptions.appName with PipelineOptions.pipelineName.
> > *  It is the user-visible name for a specific graph.
> > *  default to mainClass name.
> > *  Use cases: Find all executions of a pipeline
> > 2. Add jobName to top level PipelineOptions.
> > *  It is the unique name for an execution
> > *  defaults to pipelineName + user + datetime + random Integer
> > *  Use cases:
> > -- Finding all executions by USER_A between TIME_X and TIME_Y
> > -- Naming resources created by the execution. for example:
> > Writing temp files to folder TMP_DIR/jobName/, Writing to default
> > output file jobName.output, Creating temp /subscriptions/jobName
> >
> > Please let me know what you think.
> >
> > Thanks
> > --
> > Pei
> >
>


Re: [PROPOSAL] Website page or Jira to host all current proposal discussion and docs

2016-08-10 Thread Jean-Baptiste Onofré
Great summary.

And good idea for a meeting (even if only mailing list counts ;)).

Regards
JB



On Aug 10, 2016, 06:09, at 06:09, Frances Perry  wrote:
>So to summarize where I think this thread is at -- we'd like to more
>clearly lay out the expectations for larger proposals.
>- Explain what the design doc / proposal should include (like is done
>in
>https://cwiki.apache.org/confluence/display/KAFKA/
>Kafka+Improvement+Proposals)
>- Clearly track the open proposals (potentially in JIRA with a known
>label
>and incrementing proposal IDs).
>- Set expectations around the timelines for proposals -- both to ensure
>enough feedback is gathered and perhaps inactive proposals are
>archived.
>
>Another suggestion: How about if we try resurrecting the (virtual)
>community meetings? Anything that's a deep model change or potentially
>contentious can be presented there. Often a 15 minute overview of these
>topics can be helpful context when reading the detailed proposal.
>
>On Tue, Aug 9, 2016 at 10:18 AM, Kenneth Knowles
>
>wrote:
>
>> I didn't have a specific rubric, but here are some factors:
>>
>>  - Impact on users
>>  - Impact on other devs (while we are incubating, this is possibly a
>big
>> deal)
>>  - Backwards compatibility (not that important until stable release
>if it
>> is minor)
>>  - Amount of detail needed to understand the proposal
>>  - Whether the proposal needs multiple re-readings to understand
>thoroughly
>>  - Whether the proposal will take a while to implement, or is
>basically a
>> one-PR thing
>>
>> I think any of these is enough to consider a BIP. I'm sure others
>will
>> think of other considerations.
>>
>> All my "no" answers are pretty mild on all categories IMO. Most of
>the
>> "yes" answers are heavy in more than one.
>>
>> So actually I didn't specifically consider whether it was a model
>change,
>> but only the impact on users and backwards compatibility. For your
>example
>> of PipelineResult#waitToFinish, if we had a stable release then I
>would
>> have said "yes" for these reasons.
>>
>> The "maybe" answers were all testing infrastructure, because they
>take a
>> while to complete and have high impact on development processes. But
>now
>> that I write these criteria down, I would change the "maybe" answers
>to
>> "no".
>>
>> Thoughts?
>>
>> Kenn
>>
>> On Tue, Aug 9, 2016 at 1:15 AM, Ismaël Mejía 
>wrote:
>>
>> > Kenn, just to start the discussion, what was your criteria to
>decide what
>> > proposals are worth been a BIP ?
>> >
>> > I can clearly spot the most common case to create a BIP:  Changes
>to the
>> > model / SDK (this covers most of the 'yes' in your list, with the
>> exception
>> > of Pipeline#waitToFinish).
>> >
>> > Do you guys have ideas for other criteria ? (e.g. are new runners
>and
>> DSLs
>> > worth a BIP ?, or do Infrastructure issues deserve a BIP ?).
>> >
>> > Ismael
>> >
>> >
>> > On Mon, Aug 8, 2016 at 10:05 PM, Kenneth Knowles
>> >
>> > wrote:
>> >
>> > > +1 to the overall idea, though I would limit it to large and/or
>> long-term
>> > > proposals.
>> > >
>> > > I like:
>> > >
>> > >  - JIRA for tracking: that's what it does best.
>> > >  - Google Docs for detailed commenting and revision - basically a
>wiki
>> > with
>> > > easier commenting
>> > >  - Beam site page for process description and list of current
>"BIPs",
>> > just
>> > > a one liner and a link to JIRA. A proposal to dev@beam could
>include a
>> > > link
>> > > to a PR against the asf-site to add the BIP. However, I would
>agree
>> with
>> > > the counter-argument that this could just be a JIRA component or
>tag.
>> > > Either one works for me. Or a page with the process that links to
>a
>> JIRA
>> > > saved search. The more formal list mostly just makes it even more
>> > visible,
>> > > right?
>> > >
>> > > I think that the number can be small. Here are examples scraped
>from
>> the
>> > > mailing list archives (in random order) and whether I would use a
>> "BIP":
>> > >
>> > >  - Runner API: yes
>> > >  - Serialization tech: no
>> > >  - Dynamic parameters: yes
>> > >  - Splittable DoFn: yes
>> > >  - Scio: yes
>> > >  - Pipeline#waitToFinish(), etc: no
>> > >  - DoFn setup / teardown: yes
>> > >  - State & Timers: yes
>> > >  - Pipeline job naming changes: no
>> > >  - CoGBK as primitive: yes
>> > >  - New website design: no
>> > >  - new DoFn: yes
>> > >  - Cluster infrastructure for tests: maybe
>> > >  - Beam recipes: no
>> > >  - Two spark runners: no
>> > >  - Nightly builds by Jenkins: maybe
>> > >
>> > > When I write them all down it really is a lot :-)
>> > >
>> > > Of course, the first thing that could be discussed in a
>[PROPOSAL]
>> thread
>> > > would be whether to file a "BIP".
>> > >
>> > > Kenn
>> > >
>> > > On Mon, Aug 8, 2016 at 8:29 AM, Lukasz Cwik
>
>> > > wrote:
>> > >
>> > > > +1 for the cwiki approach that Aljoshca and Ismael gave
>examples of.
>> > > >
>> > > > On Mon, Aug 8, 2016 at 2:57 AM, Ismaël Mejía
>
>> > wrote:
>> > > >
>> > > > > +1 for a more formal "Impr