Re: [Fwd: [Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders]

2018-12-19 Thread Etienne Chauchot
Thanks Kenn for taking the time to take a look
Le mardi 18 décembre 2018 à 11:39 -0500, Kenneth Knowles a écrit :
> I don't know DataSourceV2 well, but I am reading around to try to help. I see 
> the problem with the SparkSession API.
> Is there no other way to instantiate a DataSourceV2 and read the data from it?
=> No this is exactly what I'm looking for :)
> Other thoughts:
> 
>  - Maybe start from Splittable DoFn since it is a new translator?
=> Yes but I still need to translate BoundedSource and UnBoundedSource for 
compatibility with IOs that have not migrated
to SDF
>  - I wonder if the reason for this API is that the class name and options are 
> what is shipped to workers, so the
> limited API makes serialization easy for them?
=> Yes, that and because DataSource is the entry point of the spark pipeline so 
it should not need to receive more than
user input conf, hence the String only support. But we are not users but DAG 
translators hence our need to pass more
complex objects than Strings.
>  - As a total hack, you could serialize the Beam objects (maybe to portable 
> protos) and pass that as a single
> "primitive type" option.
=> Yes, sure, it could work. Another hack would be to use ASM or ByteBuddy to 
"enhance" Spark classes but it is weak and
risky :)
> You definitely need someone from Spark more than someone from Beam for this 
> issue. At this point, I've read the
> scaladocs enough that I think I'd dig into Spark's code to see what is going 
> on and if there is a way that is more
> obviously right.
=> Yes this is what I tried but got no answer on the public spark MLs. Luckily 
I asked directly Ryan Blue of the Spark
community. He kindly answered. I'm digging into Catalog and Spark plans to get 
a different instanciation mechanism.
Etienne
> Kenn
> On Tue, Dec 18, 2018 at 11:09 AM Etienne Chauchot  
> wrote:
> > Hi everyone, 
> > Does anyone have comments on this question?
> > ThanksEtienne 
> > Le vendredi 14 décembre 2018 à 10:37 +0100, Etienne Chauchot a écrit :
> > > Hi guys,I'm currently coding a POC on a new spark runner based on 
> > > structured streaming and new DataSourceV2 API
> > > and I'm having an interrogation. Having found no pointers on the 
> > > internet, I've asked the spark community with no
> > > luck. If anyone of you have knowledge about new Spark DataSourceV2 API, 
> > > can you share thoughts?
> > > Also I did not mention in the email but I did not find any way to get a 
> > > reference on the automatically created
> > > DataSourceV2 instance, so I cannot lazy init the source either.
> > > Thanks
> > > Etienne
> > >  Message transféré De: Etienne Chauchot 
> > > À: dev@spark.apache.orgObjet:
> > > [Apache Beam] Custom DataSourceV2 instanciation: parameters passing and 
> > > EncodersDate: Tue, 11 Dec 2018 19:02:23
> > > +0100
> > > Hi Spark guys,
> > > I'm Etienne Chauchot and I'm a committer on the Apache Beam project. 
> > > We have what we call runners. They are pieces of software that translate 
> > > pipelines written using Beam API into
> > > pipelines that use native execution engine API. Currently, the Spark 
> > > runner uses old RDD / DStream APIs. I'm
> > > writing a new runner that will use structured streaming (but not 
> > > continuous processing, and also no schema for
> > > now).
> > > I am just starting. I'm currently trying to map our sources to yours. I'm 
> > > targeting new DataSourceV2 API. It maps
> > > pretty well with Beam sources but I have a problem with instanciation of 
> > > the custom source.I searched for an
> > > answer in stack-overflow and user ML with no luck. I guess it is a too 
> > > specific question:
> > > When visiting Beam DAG I have access to Beam objects such as Source and 
> > > Reader that I need to map to
> > > MicroBatchReader and InputPartitionReader.As far as I understand, a 
> > > custom DataSourceV2 is instantiated
> > > automatically by spark thanks to 
> > > sparkSession.readStream().format(providerClassName) or similar code. The 
> > > problem
> > > is that I can only pass options of primitive types + String so I cannot 
> > > pass the Beam Source to DataSourceV2. =>
> > > Is there a way to do so ?
> > > 
> > > Also I get as an output a Dataset. The Row contains an instance of 
> > > Beam WindowedValue, T is the type
> > > parameter of the Source. I  do a map on the Dataset to transform it to a 
> > > Dataset>. I have a
> > > question related to the Encoder: => how to properly create an Encoder for 
> > > the generic type WindowedValue to use
> > > in the map?
> > > Here is the 
> > > code:https://github.com/apache/beam/tree/spark-runner_structured-streaming
> > > And more specially:
> > > https://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-structured-streaming/src/main/java/org/apache/beam/runners/spark/structuredstreaming/translation/batch/ReadSourceTranslatorBatch.javahttps://github.com/apache/beam/blob/spark-runner_structured-streaming/runners/spark-stru

Re: [RFC] I made a new tabbed Beam view in Jenkins

2018-12-19 Thread Maximilian Michels

Thanks Kenn. Very useful. +1 for making it the default view!

-Max

On 18.12.18 23:12, Mark Liu wrote:

That looks great. Thank you!

Mark

On Tue, Dec 18, 2018 at 1:53 PM Jason Kuster > wrote:


Oh, a good fact! Thanks for the info. :)

On Tue, Dec 18, 2018 at 1:48 PM Alan Myrvold mailto:amyrv...@google.com>> wrote:

The _Cron variants of pre-commits are run post-commit, to make it easier
to tell if the pre-commit is flaky or broken.

On Tue, Dec 18, 2018 at 1:40 PM Jason Kuster mailto:jasonkus...@google.com>> wrote:

This looks great! (also it looks like some precommits snuck into the
postcommit view)

On Tue, Dec 18, 2018 at 1:25 PM Alan Myrvold mailto:amyrv...@google.com>> wrote:

It does look much better!

On Tue, Dec 18, 2018 at 1:10 PM Ahmet Altay mailto:al...@google.com>> wrote:

I like this version, it looks cleaner than the current
combined view.

On Tue, Dec 18, 2018 at 12:53 PM Scott Wegner
mailto:sweg...@google.com>> wrote:

Very cool. I also didn't realize we had control over the
Jenkins "views".

We currently lack a decent dashboard to monitor the
build health across Beam jenkins jobs and triage
failures; this is a step in the right direction.

I haven't played with Jenkins views before, but it
appears they can be managed via the Job DSL similar to
our job definitions [1]:

 > The DSL execution engine exposes several methods to
create Jenkins jobs, views, folders and config files. 
[..]

It would be cool to integrate this into our job config
in such a way that we could automatically keep the views
up-to-date as jobs are added or renamed.

[1]

https://github.com/jenkinsci/job-dsl-plugin/wiki/Job-DSL-Commands

On Tue, Dec 18, 2018 at 12:35 PM Anton Kedin
mailto:ke...@google.com>> wrote:

This is really helpful, didn't realize it was
possible. Categories and contents look reasonable. I
think something like this definitely should be the
top-level Beam view.

Regards,
Anton

On Tue, Dec 18, 2018 at 12:05 PM Kenneth Knowles
mailto:k...@apache.org>> wrote:

Hi all,

I made a new view to split Beam builds into
tabs:

https://builds.apache.org/view/A-D/view/Beam%20Nested/

  - PostCommit tab includes PostCommit and
"PreCommit_.*_Cron" because these are actually
post-commit jobs; it is a feature not a bug.
  - PreCommit tab includes jobs that have no
meaningful history because they are just against
PRs, commits, phrase triggering
  - Inventory self-explanatory
  - PerformanceTests self-explanatory
  - All; I didn't want to keep making categories
but just send this for feedback

WDYT about making this the top-level Beam view?
(vs 
https://builds.apache.org/view/A-D/view/Beam/)

After that, maybe we could clean the categories
so they fit into the tabs more easily with fewer
regexes (to make sure things don't get missed).
I have read also that if you use / instead of _
as a separator in a name then Jenkins will
display jobs as nested in folders automatically.
Not sure it actually results in a better view;
haven't tried it.

Kenn



-- 





Got feedback? tinyurl.com/swegner-feedback




-- 
---

Jason Kuster
Apache Beam / Google Cloud Dataflow

See something? Say something. go/jasonkuster-feedback


Re: [Fwd: [Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders]

2018-12-19 Thread Manu Zhang
The Spark community has been holding a weekly sync meeting on DataSourceV2 and 
sharing notes back to their dev list 
https://lists.apache.org/list.html?d...@spark.apache.org:lte=3M:DataSourceV2%20sync.
 At this time, there are still some moving pieces at Spark’s side. Is it too 
early to target DataSourceV2 ?

Thanks,
Manu Zhang
On Dec 19, 2018, 6:40 PM +0800, Etienne Chauchot , wrote:
> Thanks Kenn for taking the time to take a look
>
> Le mardi 18 décembre 2018 à 11:39 -0500, Kenneth Knowles a écrit :
> > I don't know DataSourceV2 well, but I am reading around to try to help. I 
> > see the problem with the SparkSession API. Is there no other way to 
> > instantiate a DataSourceV2 and read the data from it?
> => No this is exactly what I'm looking for :)
> >
> > Other thoughts:
> >
> >  - Maybe start from Splittable DoFn since it is a new translator?
> => Yes but I still need to translate BoundedSource and UnBoundedSource for 
> compatibility with IOs that have not migrated to SDF
>
> >  - I wonder if the reason for this API is that the class name and options 
> > are what is shipped to workers, so the limited API makes serialization easy 
> > for them?
> => Yes, that and because DataSource is the entry point of the spark pipeline 
> so it should not need to receive more than user input conf, hence the String 
> only support. But we are not users but DAG translators hence our need to pass 
> more complex objects than Strings.
>
> >  - As a total hack, you could serialize the Beam objects (maybe to portable 
> > protos) and pass that as a single "primitive type" option.
> => Yes, sure, it could work. Another hack would be to use ASM or ByteBuddy to 
> "enhance" Spark classes but it is weak and risky :)
> >
> > You definitely need someone from Spark more than someone from Beam for this 
> > issue. At this point, I've read the scaladocs enough that I think I'd dig 
> > into Spark's code to see what is going on and if there is a way that is 
> > more obviously right.
> => Yes this is what I tried but got no answer on the public spark MLs. 
> Luckily I asked directly Ryan Blue of the Spark community. He kindly 
> answered. I'm digging into Catalog and Spark plans to get a different 
> instanciation mechanism.
>
> Etienne
> >
> > Kenn
> >
> > > On Tue, Dec 18, 2018 at 11:09 AM Etienne Chauchot  
> > > wrote:
> > > > Hi everyone,
> > > >
> > > > Does anyone have comments on this question?
> > > >
> > > > Thanks
> > > > Etienne
> > > >
> > > > Le vendredi 14 décembre 2018 à 10:37 +0100, Etienne Chauchot a écrit :
> > > > > Hi guys,
> > > > > I'm currently coding a POC on a new spark runner based on structured 
> > > > > streaming and new DataSourceV2 API and I'm having an interrogation. 
> > > > > Having found no pointers on the internet, I've asked the spark 
> > > > > community with no luck. If anyone of you have knowledge about new 
> > > > > Spark DataSourceV2 API, can you share thoughts?
> > > > >
> > > > > Also I did not mention in the email but I did not find any way to get 
> > > > > a reference on the automatically created DataSourceV2 instance, so I 
> > > > > cannot lazy init the source either.
> > > > >
> > > > > Thanks
> > > > >
> > > > > Etienne
> > > > >
> > > > >  Message transféré 
> > > > > De: Etienne Chauchot 
> > > > > À: d...@spark.apache.org
> > > > > Objet: [Apache Beam] Custom DataSourceV2 instanciation: parameters 
> > > > > passing and Encoders
> > > > > Date: Tue, 11 Dec 2018 19:02:23 +0100
> > > > >
> > > > > Hi Spark guys,
> > > > >
> > > > > I'm Etienne Chauchot and I'm a committer on the Apache Beam project.
> > > > >
> > > > > We have what we call runners. They are pieces of software that 
> > > > > translate pipelines written using Beam API into pipelines that use 
> > > > > native execution engine API. Currently, the Spark runner uses old RDD 
> > > > > / DStream APIs.
> > > > > I'm writing a new runner that will use structured streaming (but not 
> > > > > continuous processing, and also no schema for now).
> > > > >
> > > > > I am just starting. I'm currently trying to map our sources to yours. 
> > > > > I'm targeting new DataSourceV2 API. It maps pretty well with Beam 
> > > > > sources but I have a problem with instanciation of the custom source.
> > > > > I searched for an answer in stack-overflow and user ML with no luck. 
> > > > > I guess it is a too specific question:
> > > > >
> > > > > When visiting Beam DAG I have access to Beam objects such as Source 
> > > > > and Reader that I need to map to MicroBatchReader and 
> > > > > InputPartitionReader.
> > > > > As far as I understand, a custom DataSourceV2 is instantiated 
> > > > > automatically by spark thanks to 
> > > > > sparkSession.readStream().format(providerClassName) or similar code. 
> > > > > The problem is that I can only pass options of primitive types + 
> > > > > String so I cannot pass the Beam Source to DataSourceV2.
> > > > > => Is there a way to do so ?
> > > > >
> > > > >
> > > > >

Re: [Fwd: [Apache Beam] Custom DataSourceV2 instanciation: parameters passing and Encoders]

2018-12-19 Thread Etienne Chauchot
Yes, this is thanks to these spark community meetings that I got the name of 
Ryan. And, indeed, when I saw the design
sync meetings, I realized how recent the DataSourceV2 API is. I think you are 
right, I should wait for it to be finished
and in the meantime use V1.
Etienne
Le mercredi 19 décembre 2018 à 23:27 +0800, Manu Zhang a écrit :
> The Spark community has been holding a weekly sync meeting on DataSourceV2 
> and sharing notes back to their dev list 
> https://lists.apache.org/list.html?d...@spark.apache.org:lte=3M:DataSourceV2%20sync.
>  At this time, there are still some
> moving pieces at Spark’s side. Is it too early to target DataSourceV2 ?
> 
> 
> Thanks,
> 
> Manu Zhang
> On Dec 19, 2018, 6:40 PM +0800, Etienne Chauchot , 
> wrote:
> 
> > Thanks Kenn for taking the time to take a look
> > 
> > 
> > Le mardi 18 décembre 2018 à 11:39 -0500, Kenneth Knowles a écrit :
> > > I don't know DataSourceV2 well, but I am reading around to try to help. I 
> > > see the problem with the SparkSession
> > > API. Is there no other way to instantiate a DataSourceV2 and read the 
> > > data from it?
> > 
> > => No this is exactly what I'm looking for :)
> > > 
> > > 
> > > Other thoughts:
> > > 
> > > 
> > >  - Maybe start from Splittable DoFn since it is a new translator?
> > > 
> > > 
> > 
> > => Yes but I still need to translate BoundedSource and UnBoundedSource for 
> > compatibility with IOs that have not
> > migrated to SDF
> > 
> > 
> > >  - I wonder if the reason for this API is that the class name and options 
> > > are what is shipped to workers, so the
> > > limited API makes serialization easy for them?
> > > 
> > 
> > => Yes, that and because DataSource is the entry point of the spark 
> > pipeline so it should not need to receive more
> > than user input conf, hence the String only support. But we are not users 
> > but DAG translators hence our need to pass
> > more complex objects than Strings.
> > 
> > 
> > > 
> > >  - As a total hack, you could serialize the Beam objects (maybe to 
> > > portable protos) and pass that as a single
> > > "primitive type" option.
> > > 
> > > 
> > 
> > => Yes, sure, it could work. Another hack would be to use ASM or ByteBuddy 
> > to "enhance" Spark classes but it is weak
> > and risky :)
> > > 
> > > 
> > > 
> > > You definitely need someone from Spark more than someone from Beam for 
> > > this issue. At this point, I've read the
> > > scaladocs enough that I think I'd dig into Spark's code to see what is 
> > > going on and if there is a way that is more
> > > obviously right.
> > > 
> > > 
> > 
> > => Yes this is what I tried but got no answer on the public spark MLs. 
> > Luckily I asked directly Ryan Blue of the
> > Spark community. He kindly answered. I'm digging into Catalog and Spark 
> > plans to get a different instanciation
> > mechanism.
> > 
> > 
> > Etienne
> > > 
> > > 
> > > 
> > > 
> > > Kenn
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > On Tue, Dec 18, 2018 at 11:09 AM Etienne Chauchot  
> > > wrote:
> > > 
> > > > Hi everyone, 
> > > > 
> > > > 
> > > > Does anyone have comments on this question?
> > > > 
> > > > 
> > > > Thanks
> > > > Etienne
> > > > 
> > > > 
> > > > Le vendredi 14 décembre 2018 à 10:37 +0100, Etienne Chauchot a écrit :
> > > > > Hi guys,
> > > > > I'm currently coding a POC on a new spark runner based on structured 
> > > > > streaming and new DataSourceV2 API and
> > > > > I'm having an interrogation. Having found no pointers on the 
> > > > > internet, I've asked the spark community with no
> > > > > luck. If anyone of you have knowledge about new Spark DataSourceV2 
> > > > > API, can you share thoughts?
> > > > > 
> > > > > 
> > > > > Also I did not mention in the email but I did not find any way to get 
> > > > > a reference on the automatically created
> > > > > DataSourceV2 instance, so I cannot lazy init the source either.
> > > > > 
> > > > > 
> > > > > Thanks
> > > > > 
> > > > > 
> > > > > Etienne
> > > > > 
> > > > > 
> > > > >  Message transféré 
> > > > > De: Etienne Chauchot 
> > > > > À: d...@spark.apache.org
> > > > > Objet: [Apache Beam] Custom DataSourceV2 instanciation: parameters 
> > > > > passing and Encoders
> > > > > Date: Tue, 11 Dec 2018 19:02:23 +0100
> > > > > 
> > > > > 
> > > > > Hi Spark guys,
> > > > > 
> > > > > 
> > > > > I'm Etienne Chauchot and I'm a committer on the Apache Beam project.
> > > > > 
> > > > > 
> > > > > We have what we call runners. They are pieces of software that 
> > > > > translate pipelines written using Beam API into
> > > > > pipelines that use native execution engine API. Currently, the Spark 
> > > > > runner uses old RDD / DStream APIs.
> > > > > I'm writing a new runner that will use structured streaming (but not 
> > > > > continuous processing, and also no schema
> > > > > for now).
> > > > > 
> > > > > 
> > > > > I am just starting. I'm currently trying to map our sources to yours. 
> > > > > I'm targeting new DataSourceV2 API. It
> > >

Beam Contribution

2018-12-19 Thread Theodore Siu
To whom it may concern,

My name is Theodore Siu. I am a customer facing data engineer at Google. I
would like to be added as a contributor to Beam. My Jira user name is tsiu.
Thank you for your time.

-Theo


Re: Beam Contribution

2018-12-19 Thread Kenneth Knowles
Hi Theo, and welcome!

I have added you to the "Contributors" role on Beam Jira, so you should be
able to assign tickets to yourself if you want.

Kenn

On Wed, Dec 19, 2018 at 12:00 PM Theodore Siu  wrote:

> To whom it may concern,
>
> My name is Theodore Siu. I am a customer facing data engineer at Google. I
> would like to be added as a contributor to Beam. My Jira user name is tsiu.
> Thank you for your time.
>
> -Theo
>


excessive java precommit logging

2018-12-19 Thread Udi Meiri
Hi all,
I'd like to reduce precommit log sizes on Jenkins. For example:
https://builds.apache.org/job/beam_PreCommit_Java_Commit/3181/consoleFull
is 79M, which makes Chrome sluggish to use on it (tab is constantly using a
whole cpu core).

I know this might be controversial, but I'd like to propose to remove the
--info flag from the gradlew command line.


smime.p7s
Description: S/MIME Cryptographic Signature


Re: excessive java precommit logging

2018-12-19 Thread Scott Wegner
I'm not sure what we lose by dropping the --info flag, but I generally
worry about reducing log output since logs are the main resource for
diagnosing Jenkins build errors.

It seems the issue is that Chrome doesn't scale well to large log files. A
few alternative solutions:

1. Use the produced Build Scan (example: [1]) instead of the raw console
log. The build scan is quite useful at pointing to what actually failed,
and filtering log output for only that task.
2. Instead of consoleFull, use consoleText ("View as plain text" link in
Jenkins), which seems to be much easier on Chrome
3. Download the consoleText output locally and use your favorite log viewer
that can scale to large files.

[1] https://gradle.com/s/ckhjrjdexpuzm

On Wed, Dec 19, 2018 at 10:42 AM Udi Meiri  wrote:

> Hi all,
> I'd like to reduce precommit log sizes on Jenkins. For example:
> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3181/consoleFull
> is 79M, which makes Chrome sluggish to use on it (tab is constantly using
> a whole cpu core).
>
> I know this might be controversial, but I'd like to propose to remove the
> --info flag from the gradlew command line.
>
>

-- 




Got feedback? tinyurl.com/swegner-feedback


Re: excessive java precommit logging

2018-12-19 Thread Thomas Weise
I usually follow the download procedure outlined by Scott to look at the
logs.

These logs are big, but when there is a problem it is sometimes essential
to have the extra output, especially for less frequent flakes.

Reducing logs would then require the author to add extra logging to the PR
(and attempt to reproduce), which is also not nice.

Thomas


On Wed, Dec 19, 2018 at 11:47 AM Scott Wegner  wrote:

> I'm not sure what we lose by dropping the --info flag, but I generally
> worry about reducing log output since logs are the main resource for
> diagnosing Jenkins build errors.
>
> It seems the issue is that Chrome doesn't scale well to large log files. A
> few alternative solutions:
>
> 1. Use the produced Build Scan (example: [1]) instead of the raw console
> log. The build scan is quite useful at pointing to what actually failed,
> and filtering log output for only that task.
> 2. Instead of consoleFull, use consoleText ("View as plain text" link in
> Jenkins), which seems to be much easier on Chrome
> 3. Download the consoleText output locally and use your favorite log
> viewer that can scale to large files.
>
> [1] https://gradle.com/s/ckhjrjdexpuzm
>
> On Wed, Dec 19, 2018 at 10:42 AM Udi Meiri  wrote:
>
>> Hi all,
>> I'd like to reduce precommit log sizes on Jenkins. For example:
>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3181/consoleFull
>> is 79M, which makes Chrome sluggish to use on it (tab is constantly using
>> a whole cpu core).
>>
>> I know this might be controversial, but I'd like to propose to remove the
>> --info flag from the gradlew command line.
>>
>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


(fyi) PR#7324 updates how projects consume vendored guava

2018-12-19 Thread Scott Wegner
Just a heads up: PR#7324 [1] just got merged which fixes an issue with our
vendored guava artifact and JNI: see BEAM-6056 [2] for details. Important
changes for you:

a. IntelliJ doesn't understand how our vendored guava is built and will
give red squigglies. This is annoying but temporary while we release a new
vendored artifact.
b. The guava vendoring prefix has changed. If you have work-in-progress,
you'll hit this when you merge into master. Update any new guava imports
to: org.apache.beam.vendor.grpc.v1p3p1.[..]

Read on for details..

The fix for BEAM-6056 included two changes:
1. Updated package prefix for vendored guava symbols:
o.a.b.vendor.grpc.v1_13_1 -> o.a.b.vendor.grpc.v1p3p1
2. Move project dependencies to consume vendored guava
from :beam-vendor-grpc-1_13_1 project rather than the published Maven
artifact.

Consuming from the integrated project artifact (#2) was necessary in order
to make the changes and validate them inline without needing to first
publish a Maven release. However, this has some downsides: notably,
IntelliJ struggles to understand shaded dependencies [3]. I believe we
should move back to consuming pre-published vendor artifacts. The first
step is to publish a new vendored beam-vendor-grpc-1_13_1 artifact; I can
start that process.


[1] https://github.com/apache/beam/pull/7324
[2] https://issues.apache.org/jira/browse/BEAM-6056
[3]
https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E

Got feedback? tinyurl.com/swegner-feedback


Re: (fyi) PR#7324 updates how projects consume vendored guava

2018-12-19 Thread Thomas Weise
Hi Scott,

Can testing of the artifact prior to release not be done against the
staging repo?

See here for an example: https://github.com/apache/beam/pull/7322/files

It is unfortunate that we have to change the imports on every patch. Is it
not sufficient to just update the artifact version?

Thomas



On Wed, Dec 19, 2018 at 1:49 PM Scott Wegner  wrote:

> Just a heads up: PR#7324 [1] just got merged which fixes an issue with
> our vendored guava artifact and JNI: see BEAM-6056 [2] for details.
> Important changes for you:
>
> a. IntelliJ doesn't understand how our vendored guava is built and will
> give red squigglies. This is annoying but temporary while we release a new
> vendored artifact.
> b. The guava vendoring prefix has changed. If you have work-in-progress,
> you'll hit this when you merge into master. Update any new guava imports
> to: org.apache.beam.vendor.grpc.v1p3p1.[..]
>
> Read on for details..
>
> The fix for BEAM-6056 included two changes:
> 1. Updated package prefix for vendored guava symbols:
> o.a.b.vendor.grpc.v1_13_1 -> o.a.b.vendor.grpc.v1p3p1
> 2. Move project dependencies to consume vendored guava
> from :beam-vendor-grpc-1_13_1 project rather than the published Maven
> artifact.
>
> Consuming from the integrated project artifact (#2) was necessary in order
> to make the changes and validate them inline without needing to first
> publish a Maven release. However, this has some downsides: notably,
> IntelliJ struggles to understand shaded dependencies [3]. I believe we
> should move back to consuming pre-published vendor artifacts. The first
> step is to publish a new vendored beam-vendor-grpc-1_13_1 artifact; I can
> start that process.
>
>
> [1] https://github.com/apache/beam/pull/7324
> [2] https://issues.apache.org/jira/browse/BEAM-6056
> [3]
> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
>
> Got feedback? tinyurl.com/swegner-feedback
>


Re: excessive java precommit logging

2018-12-19 Thread Udi Meiri
The gradle scan doesn't pinpoint the error message, and it doesn't contain
all the lines: https://scans.gradle.com/s/ckhjrjdexpuzm/console-log

The logs might be useful, but usually not from passing tests. Doesn't
gradle log output from failed tests by default?

On Wed, Dec 19, 2018 at 1:22 PM Thomas Weise  wrote:

> I usually follow the download procedure outlined by Scott to look at the
> logs.
>
> These logs are big, but when there is a problem it is sometimes essential
> to have the extra output, especially for less frequent flakes.
>
> Reducing logs would then require the author to add extra logging to the PR
> (and attempt to reproduce), which is also not nice.
>
> Thomas
>
>
> On Wed, Dec 19, 2018 at 11:47 AM Scott Wegner  wrote:
>
>> I'm not sure what we lose by dropping the --info flag, but I generally
>> worry about reducing log output since logs are the main resource for
>> diagnosing Jenkins build errors.
>>
>> It seems the issue is that Chrome doesn't scale well to large log files.
>> A few alternative solutions:
>>
>> 1. Use the produced Build Scan (example: [1]) instead of the raw console
>> log. The build scan is quite useful at pointing to what actually failed,
>> and filtering log output for only that task.
>> 2. Instead of consoleFull, use consoleText ("View as plain text" link in
>> Jenkins), which seems to be much easier on Chrome
>> 3. Download the consoleText output locally and use your favorite log
>> viewer that can scale to large files.
>>
>> [1] https://gradle.com/s/ckhjrjdexpuzm
>>
>> On Wed, Dec 19, 2018 at 10:42 AM Udi Meiri  wrote:
>>
>>> Hi all,
>>> I'd like to reduce precommit log sizes on Jenkins. For example:
>>> https://builds.apache.org/job/beam_PreCommit_Java_Commit/3181/consoleFull
>>> is 79M, which makes Chrome sluggish to use on it (tab is constantly
>>> using a whole cpu core).
>>>
>>> I know this might be controversial, but I'd like to propose to remove
>>> the --info flag from the gradlew command line.
>>>
>>>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>


smime.p7s
Description: S/MIME Cryptographic Signature


Re: (fyi) PR#7324 updates how projects consume vendored guava

2018-12-19 Thread Scott Wegner
> Can testing of the artifact prior to release not be done against the
staging repo?

I suspect it could be. The process for making these updates has not yet
been defined, and I agree that having this intermediate state on every
update is not ideal.

(also, I mistakenly wrote Guava in the above email; this fix was actually
for gRPC)

Here's a proposal for a path forward:

1. Release a new vendored guava artifact
2. Switch all Beam projects to consume the vendored artifact from Maven
Central
  2b. Add validation to ensure all projects reference guava consistently
from Maven Central and not the build.

And then define the process for future vendoring updates:
3. Increment the vendored artifact version number, and make changes to the
vendored source project
4. Push the new vendored artifact to a staging repository
5. In a separate commit, add the staging repository as a dependency source
6. Increment the vendored artifact dependency to the new version and update
consuming code as necessary
7. Open a PR with all changes for validation / code review, but don't merge
yet
8. Open a separate PR with just the vendored artifact changes (up to step 4)
9. Once reviewed/merged, kick off a release of the vendored artifact
10. Once released, remove staging repository for pending PR (undo step 5),
and merge after reviewed.


This process would be more tedious / time-consuming, but less disrupting
for other contributors.

On Wed, Dec 19, 2018 at 2:06 PM Thomas Weise  wrote:

> Hi Scott,
>
> Can testing of the artifact prior to release not be done against the
> staging repo?
>
> See here for an example: https://github.com/apache/beam/pull/7322/files
>
> It is unfortunate that we have to change the imports on every patch. Is it
> not sufficient to just update the artifact version?
>
> Thomas
>
>
>
> On Wed, Dec 19, 2018 at 1:49 PM Scott Wegner  wrote:
>
>> Just a heads up: PR#7324 [1] just got merged which fixes an issue with
>> our vendored guava artifact and JNI: see BEAM-6056 [2] for details.
>> Important changes for you:
>>
>> a. IntelliJ doesn't understand how our vendored guava is built and will
>> give red squigglies. This is annoying but temporary while we release a new
>> vendored artifact.
>> b. The guava vendoring prefix has changed. If you have work-in-progress,
>> you'll hit this when you merge into master. Update any new guava imports
>> to: org.apache.beam.vendor.grpc.v1p3p1.[..]
>>
>> Read on for details..
>>
>> The fix for BEAM-6056 included two changes:
>> 1. Updated package prefix for vendored guava symbols:
>> o.a.b.vendor.grpc.v1_13_1 -> o.a.b.vendor.grpc.v1p3p1
>> 2. Move project dependencies to consume vendored guava
>> from :beam-vendor-grpc-1_13_1 project rather than the published Maven
>> artifact.
>>
>> Consuming from the integrated project artifact (#2) was necessary in
>> order to make the changes and validate them inline without needing to first
>> publish a Maven release. However, this has some downsides: notably,
>> IntelliJ struggles to understand shaded dependencies [3]. I believe we
>> should move back to consuming pre-published vendor artifacts. The first
>> step is to publish a new vendored beam-vendor-grpc-1_13_1 artifact; I can
>> start that process.
>>
>>
>> [1] https://github.com/apache/beam/pull/7324
>> [2] https://issues.apache.org/jira/browse/BEAM-6056
>> [3]
>> https://lists.apache.org/thread.html/4c12db35b40a6d56e170cd6fc8bb0ac4c43a99aa3cb7dbae54176815@%3Cdev.beam.apache.org%3E
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>

-- 




Got feedback? tinyurl.com/swegner-feedback


Beam Summits!

2018-12-19 Thread Austin Bennett
Hi All,

I really enjoyed Beam Summit in London (Thanks Matthias!), and there was
much enthusiasm for continuations.  We had selected that location in a
large part due to the growing community there, and we have users in a
variety of locations.  In our 2019 calendar,
https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
shared in the past weeks, 3 Summits are tentatively slotted for this year.
Wanting to start running this by the group to get input.

* Beam Summit NA, in San Francisco, approx 3 April 2019 (following Flink
Forward).  I can organize.
* Beam Summit Europe, in Stockholm, this was the runner up in voting
falling behind London.  Or perhaps Berlin?  October-ish 2019
* Beam Summit Asia, in Tokyo ??

What are general thoughts on locations/dates?

Looking forward to convening in person soon.

Cheers,
Austin


Re: Beam Summits!

2018-12-19 Thread Suneel Marthi
How about Beam Summit in Berlin on Sep 6 immediately following Flink
Forward Berlin on the previous 2 days.

Same may be for Asia also following Flink Forward Asia where and whenever
it happens.

On Wed, Dec 19, 2018 at 6:06 PM Austin Bennett 
wrote:

> Hi All,
>
> I really enjoyed Beam Summit in London (Thanks Matthias!), and there was
> much enthusiasm for continuations.  We had selected that location in a
> large part due to the growing community there, and we have users in a
> variety of locations.  In our 2019 calendar,
> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
> shared in the past weeks, 3 Summits are tentatively slotted for this year.
> Wanting to start running this by the group to get input.
>
> * Beam Summit NA, in San Francisco, approx 3 April 2019 (following Flink
> Forward).  I can organize.
> * Beam Summit Europe, in Stockholm, this was the runner up in voting
> falling behind London.  Or perhaps Berlin?  October-ish 2019
> * Beam Summit Asia, in Tokyo ??
>
> What are general thoughts on locations/dates?
>
> Looking forward to convening in person soon.
>
> Cheers,
> Austin
>


[VOTE] Release Vendored gRPC 1.13.1 v0.2, release candidate #1

2018-12-19 Thread Scott Wegner
Please review and vote on the release candidate #1 for the vendored
artifact gRPC 1.13.1 v0.2
[ ] +1, Approve the release
[ ] -1, Do not approve the release (please provide specific comments)

This is a follow-up to the previous thread about vendoring updates [1]

The complete staging area is available for your review, which includes:
* all artifacts to be deployed to the Maven Central Repository [2],
* commit hash "3b8abca3ca3352e6bf20e059f17324049a2eae0a" [3],
* artifacts which are signed with the key with fingerprint
5F47BD54C52008007288FF4D3593BA6C25ABF71F [4]

The vote will be open for at least 72 hours. It is adopted by majority
approval, with at least 3 PMC affirmative votes.

Thanks,
Scott

[1]
https://lists.apache.org/thread.html/9a55d12000cb3b1b61620b7dc4009d1351e6b8c70951f70aeb358583@%3Cdev.beam.apache.org%3E
[2] https://repository.apache.org/content/repositories/orgapachebeam-1055/
[3] https://github.com/apache/beam/pull/7328
[4] https://dist.apache.org/repos/dist/release/beam/KEYS
-- 




Got feedback? tinyurl.com/swegner-feedback


Re: [VOTE] Release Vendored gRPC 1.13.1 v0.2, release candidate #1

2018-12-19 Thread Kenneth Knowles
+1

 - sigs good
 - `jar tf` looks good

On Wed, Dec 19, 2018 at 7:54 PM Scott Wegner  wrote:

> Please review and vote on the release candidate #1 for the vendored
> artifact gRPC 1.13.1 v0.2
> [ ] +1, Approve the release
> [ ] -1, Do not approve the release (please provide specific comments)
>
> This is a follow-up to the previous thread about vendoring updates [1]
>
> The complete staging area is available for your review, which includes:
> * all artifacts to be deployed to the Maven Central Repository [2],
> * commit hash "3b8abca3ca3352e6bf20e059f17324049a2eae0a" [3],
> * artifacts which are signed with the key with fingerprint
> 5F47BD54C52008007288FF4D3593BA6C25ABF71F [4]
>
> The vote will be open for at least 72 hours. It is adopted by majority
> approval, with at least 3 PMC affirmative votes.
>
> Thanks,
> Scott
>
> [1]
> https://lists.apache.org/thread.html/9a55d12000cb3b1b61620b7dc4009d1351e6b8c70951f70aeb358583@%3Cdev.beam.apache.org%3E
> [2] https://repository.apache.org/content/repositories/orgapachebeam-1055/
> [3] https://github.com/apache/beam/pull/7328
> [4] https://dist.apache.org/repos/dist/release/beam/KEYS
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>


ElementCount PR + Proper package locations for file visibility in my element count PR.

2018-12-19 Thread Alex Amato
Hello Robert + beam community,

I have added the element count metrics to the Java SDK in this PR. In doing
so, I enhanced the Metrics.counter call to have a LabeledMetrics.counter()
call which allows constructing a metric for a MonitoringInfo urn and set of
labels, so they can be extracted properly with the PCollection label and
packaged into a MonitoringInfo.
https://github.com/apache/beam/pull/7272

*I was hoping you could take a look and let me know if I am putting code in
the correct packages/projects:*

   - 
runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/LabeledMetrics.java
   # visible to runner harness and SDK implementations only, not pipeline
   authors
   - 
runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/MonitoringInfoMetricName.java
   # visible to runner harness and SDK implementations only, not pipeline
   authors
   - 
sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data/ElementCountFnDataReceiver.java
   # visible to SDK implementations only, not pipeline authors
   - 
sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data/PCollectionConsumerRegistry.java
   # visible to SDK implementations only, not pipeline authors



There is also a refactor to the construction of PCollection consumers using
a PCollectionConsumerRegistry, which allowed creating a spot in the code
where we could wrap all PCollection consumption with an ElementCount
counter.


Please let me know what you think. If you think this PR should be split,
please let me know so that I can do it tomorrow morning

Thanks again,
Alex


Re: ElementCount PR + Proper package locations for file visibility in my element count PR.

2018-12-19 Thread Kenneth Knowles
High level point: the only SDK implementation that any of these could be
relevant for is Java, so "visible to SDK implementations" is just "visible
to the Java SDK Harness" (sdks/java/core must not depend on any of this)

Also sdks/java/harness is not a library, but a (containerized) service. So
it shouldn't visible to anything.

Kenn

On Wed, Dec 19, 2018 at 9:14 PM Alex Amato  wrote:

> Hello Robert + beam community,
>
> I have added the element count metrics to the Java SDK in this PR. In
> doing so, I enhanced the Metrics.counter call to have a
> LabeledMetrics.counter() call which allows constructing a metric for a
> MonitoringInfo urn and set of labels, so they can be extracted properly
> with the PCollection label and packaged into a MonitoringInfo.
> https://github.com/apache/beam/pull/7272
>
> *I was hoping you could take a look and let me know if I am putting code
> in the correct packages/projects:*
>
>- 
> runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/LabeledMetrics.java
># visible to runner harness and SDK implementations only, not pipeline
>authors
>- 
> runners/core-java/src/main/java/org/apache/beam/runners/core/metrics/MonitoringInfoMetricName.java
># visible to runner harness and SDK implementations only, not pipeline
>authors
>- 
> sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data/ElementCountFnDataReceiver.java
># visible to SDK implementations only, not pipeline authors
>- 
> sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data/PCollectionConsumerRegistry.java
># visible to SDK implementations only, not pipeline authors
>
>
>
> There is also a refactor to the construction of PCollection consumers
> using a PCollectionConsumerRegistry, which allowed creating a spot in the
> code where we could wrap all PCollection consumption with an ElementCount
> counter.
>
>
> Please let me know what you think. If you think this PR should be split,
> please let me know so that I can do it tomorrow morning
>
> Thanks again,
> Alex
>


Re: Beam Summits!

2018-12-19 Thread Thomas Weise
I think for EU there is a proposal to have it next to Berlin Buzzwords in
June. That would provide better spacing and avoid conflict with ApacheCon.

Thomas


On Wed, Dec 19, 2018 at 3:09 PM Suneel Marthi  wrote:

> How about Beam Summit in Berlin on Sep 6 immediately following Flink
> Forward Berlin on the previous 2 days.
>
> Same may be for Asia also following Flink Forward Asia where and whenever
> it happens.
>
> On Wed, Dec 19, 2018 at 6:06 PM Austin Bennett <
> whatwouldausti...@gmail.com> wrote:
>
>> Hi All,
>>
>> I really enjoyed Beam Summit in London (Thanks Matthias!), and there was
>> much enthusiasm for continuations.  We had selected that location in a
>> large part due to the growing community there, and we have users in a
>> variety of locations.  In our 2019 calendar,
>> https://docs.google.com/spreadsheets/d/1CloF63FOKSPM6YIuu8eExjhX6xrIiOp5j4zPbSg3Apo/
>> shared in the past weeks, 3 Summits are tentatively slotted for this year.
>> Wanting to start running this by the group to get input.
>>
>> * Beam Summit NA, in San Francisco, approx 3 April 2019 (following Flink
>> Forward).  I can organize.
>> * Beam Summit Europe, in Stockholm, this was the runner up in voting
>> falling behind London.  Or perhaps Berlin?  October-ish 2019
>> * Beam Summit Asia, in Tokyo ??
>>
>> What are general thoughts on locations/dates?
>>
>> Looking forward to convening in person soon.
>>
>> Cheers,
>> Austin
>>
>