Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Kenneth Knowles
+1 pending good enough tooling (I can't quite tell - seems there are some
issues?)

On Wed, May 29, 2019 at 2:40 PM Katarzyna Kucharczyk <
ka.kucharc...@gmail.com> wrote:

> What else actually we gain? My guess is faster PR review iteration. We
> will skip some of conversations about code style.
>
...

> Last but not least, new contributor may be less discouraged. When I
> started contribute I didn’t know how to format my code and I lost a lot of
> time to add pylint and adjust IntelliJ. I eventually failed. Currently I
> write code intuitively and when I don’t forget I rerun tox.
>

This is a huge benefit. This is why I supported it so much for Java. It is
a community benefit. You do not have to be a contributor to the Python SDK
to support this. That is why I am writing here. Just eliminate all
discussion of formatting. It doesn't really matter what the resulting
format is, if it is not crazy to read. I strongly oppose maintaining a
non-default format.

Reformating 20k lines or 200k is not hard. The Java global reformat touched
50k lines. It does not really matter how big it is. Definitely do it all at
once if you think the tool is good enough. And you should pin a version, so
churn is not a problem. You can upgrade the version and reformat in a PR
later and that is also easy.

It is a GitHub UI issue that you cannot easily browse past the reformat. It
is not actually that hard, but does take a couple extra clicks to get
GitHub to display blame before a reformat. It is easier with the command
line. I do a lot of code history digging and the global Java reformat is
not really a problem.

Kenn



> Also everything will be formatted in a same way, so eventually it would be
> easier to read.
>
> Moreover, as it was mentioned in previous emails - a lot of Jenkins
> failures won’t take place, so we save time and resources.
>
>
> One of disadvantages is that our pipelines has custom syntax and after
> formatting they looks a little bit weird, but maybe extending the only
> configurable option in Black - lines, from 88 to 110 would be solution.
>
> Second one is that Black requires Python 3 to be run. I don’t know how big
> obstacle it would be.
>
> I believe there are two options how it would be possible to introduce
> Black. First: just do it, it will hurt but then it would be ok (same as a
> dentist appointment). Of course it may require some work to adjust linters.
> On the other hand we can do it gradually and start including sdk parts one
> by one - maybe it will be less painful?
>
> As an example I can share one of projects [2] I know that uses Black (they
> use also other cool checkers and pre-commit [3]). This is how looks their
> build with all checks [4].
>
> To sum up I believe that if we want improve our coding experience, we
> should improve our toolset. Black seems be recent and quite popular tool
> what makes think they won’t stop developing it.
>
> [1]
> https://stackoverflow.com/questions/4112410/git-change-styling-whitespace-without-changing-ownership-blame
>
> [2]  https://github.com/GoogleCloudPlatform/oozie-to-airflow
>
> [3] https://pre-commit.com
>
> [4]
> https://travis-ci.org/GoogleCloudPlatform/oozie-to-airflow/builds/538725689
>
>
> On Wed, May 29, 2019 at 2:01 PM Robert Bradshaw 
> wrote:
>
>> Reformatting to 4 spaces seems a non-starter to me, as it would change
>> nearly every single line in the codebase (and the loss of all context as
>> well as that particular line).
>>
>> This is probably why the 2-space fork exists. However, we don't conform
>> to that either--we use 2 spaces for indentation, but 4 for continuation
>> indentation. (As for the history of this, this goes back to Google's
>> internal style guide, probably motivated by consistency with C++, Java, ...
>> and the fact that with an indent level of 4 one ends up wrapping lines
>> quite frequently (it's telling that black's default line length is 88)).
>> This turns out to be an easy change to the codebase.
>>
>> Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool
>> introduces a huge amount of vertical whitespace (e.g. closing parentheses
>> on their own line), e.g.
>>
>> def foo(
>> args
>> ):
>>   if (
>>   long expression)
>>   ):
>> func(
>> args
>> )
>>
>> I wrote a simple post-processor to put closing parentheses on the same
>> lines, as well as omit the newline after "if (", and disabling formatting
>> of strings, which reduce the churn in our codebase to 15k lines (adding
>> about 4k) out of 200k total.
>>
>> https://github.com/apache/beam/pull/8712/files
>>
>> It's still very opinionated, often in different ways then me, and doesn't
>> understand the semantics of the code, but possibly something we could live
>> with given the huge advantages of an autoformatter.
>>
>> An intermediate point would be to allow, but not require, autoformatting
>> of changed lines.
>>
>> As for being beta quality, it looks like it's got a decent number of
>> contributors and in my book 

**Request to add me as a contributor.**

2019-05-29 Thread Akshay Iyangar
Hello everyone,

My name is Akshay Iyangar, using beam repo extensively. There is a small patch 
that I would like to push through upstream. 
https://issues.apache.org/jira/browse/BEAM-7442 . I’m working on this issue and 
hope to become a contributor for Beam's JIRA issue tracker so that I can assign 
it to myself.

My JIRA id is - aiyangar

Thanks,
Akshay Iyangar



Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-05-29 Thread Reza Rokni
+1

I think there will be at least two layers of this;

Layer 1 - Using primitives : I do join, GBK, Aggregation... with system x
this way, what is the canonical equivalent in Beam.
Layer 2 - Patterns : I read and join Unbounded and Bounded Data in system x
this way, what is the canonical equivalent in Beam.

I suspect as a first pass Layer 1 is reasonably well bounded work, there
would need to be agreement on "canonical" version of how to do something in
Beam as this could be seen to be opinionated. As there are often a
multitude of ways of doing x


On Thu, 30 May 2019 at 08:56, Ahmet Altay  wrote:

> Hi all,
>
> Inspired by the user asking about a Spark feature in Beam [1] in the
> release thread, I searched the user@ list and noticed a few instances of
> people asking for question like "I can do X in Spark, how can I do that in
> Beam?" Would it make sense to add documentation to explain how certain
> tasks that can be accomplished in Beam with side by side examples of doing
> the same task in Beam/Spark etc. It could help with on-boarding because it
> will be easier for people to leverage their existing knowledge. It could
> also help other frameworks as well, because it will serve as a Rosetta
> stone with two translations.
>
> Questions I have are:
> - Would such a thing be a helpful?
> - Is it feasible? Would a few pages worth of examples can cover enough use
> cases?
>
> Thank you!
> Ahmet
>
> [1]
> https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E
>


-- 

This email may be confidential and privileged. If you received this
communication by mistake, please don't forward it to anyone else, please
erase all copies and attachments, and please let me know that it has gone
to the wrong person.

The above terms reflect a potential business arrangement, are provided
solely as a basis for further discussion, and are not intended to be and do
not constitute a legally binding obligation. No legally binding obligations
will be created, implied, or inferred until an agreement in final form is
executed in writing by all parties involved.


[DISCUSS] Cookbooks for users with knowledge in other frameworks

2019-05-29 Thread Ahmet Altay
Hi all,

Inspired by the user asking about a Spark feature in Beam [1] in the
release thread, I searched the user@ list and noticed a few instances of
people asking for question like "I can do X in Spark, how can I do that in
Beam?" Would it make sense to add documentation to explain how certain
tasks that can be accomplished in Beam with side by side examples of doing
the same task in Beam/Spark etc. It could help with on-boarding because it
will be easier for people to leverage their existing knowledge. It could
also help other frameworks as well, because it will serve as a Rosetta
stone with two translations.

Questions I have are:
- Would such a thing be a helpful?
- Is it feasible? Would a few pages worth of examples can cover enough use
cases?

Thank you!
Ahmet

[1]
https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E


Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.

2019-05-29 Thread Ankur Goenka
I agree, I think their are few things which have to be though through as
part of Portable image release.

* Where to host the images. We can ofcourse have an alias for the image
which can point to a different location but the hosting location have to be
sort through.
* Validation process for the images.
* Backward compatibility for the images. Though we can just tag them with
release name.

I might not have immediate bandwidth to do this so we need to prioritize
based on other items we have.


On Tue, May 28, 2019 at 5:24 PM Ahmet Altay  wrote:

> Could we first figure out the process (where to push, how to push,
> permissions needed, how to validate etc.) as part of the snapshots and
> update the release guide based on that?
>
> On Tue, May 28, 2019 at 2:43 AM Robert Bradshaw 
> wrote:
>
>> In the future (read, next release) the SDK will likely have reference
>> to the containers, so this will have to be part of the release.
>
>
> Who is working on this change? Could they help with figuring out the
> publishing the containers part?
>
>
>> But I
>> agree for 2.13 it should be more about figuring out the process and
>> not necessarily holding back.
>>
>> On Mon, May 27, 2019 at 7:42 PM Ankur Goenka  wrote:
>> >
>> > +1
>> > We can release the images with 2.13 but we should not block 2.13
>> release for this.
>> >
>> > On Mon, May 27, 2019, 8:39 AM Thomas Weise  wrote:
>> >>
>> >> +1
>> >>
>> >>
>> >> On Mon, May 27, 2019 at 6:56 AM Ismaël Mejía 
>> wrote:
>> >>>
>> >>> +1
>> >>>
>> >>> On Mon, May 27, 2019 at 3:35 PM Maximilian Michels 
>> wrote:
>> >>> >
>> >>> > +1
>> >>> >
>> >>> > On 27.05.19 14:04, Robert Bradshaw wrote:
>> >>> > > Sounds like everyone's onboard with the plan. Any chance we could
>> >>> > > publish these for the upcoming 2.13 release?
>> >>> > >
>> >>> > > On Wed, Feb 6, 2019 at 6:29 PM Łukasz Gajowy 
>> wrote:
>> >>> > >>
>> >>> > >> +1 to have a registry for images accessible to anyone. For
>> snapshot images, I agree that gcr + apache-beam-testing project seems a
>> good and easy way to start with.
>> >>> > >>
>> >>> > >> Łukasz
>> >>> > >>
>> >>> > >> wt., 22 sty 2019 o 19:43 Mark Liu 
>> napisał(a):
>> >>> > >>>
>> >>> > >>> +1 to have an official Beam released container image.
>> >>> > >>>
>> >>> > >>> Also I would propose to add a verification step to (or after)
>> the release process to do smoke check. Python have ValidatesContainer test
>> that runs basic pipeline using newly built container for verification.
>> Other sdk languages can do similar thing or add a common framework.
>> >>> > >>>
>> >>> > >>> Mark
>> >>> > >>>
>> >>> > >>> On Thu, Jan 17, 2019 at 5:56 AM Alan Myrvold <
>> amyrv...@google.com> wrote:
>> >>> > 
>> >>> >  +1 This would be great. gcr.io seems like a good option for
>> snapshots due to the permissions from jenkins to upload and ability to keep
>> snapshots around.
>> >>> > 
>> >>> >  On Wed, Jan 16, 2019 at 6:51 PM Ruoyun Huang <
>> ruo...@google.com> wrote:
>> >>> > >
>> >>> > > +1 This would be a great thing to have.
>> >>> > >
>> >>> > > On Wed, Jan 16, 2019 at 6:11 PM Ankur Goenka <
>> goe...@google.com> wrote:
>> >>> > >>
>> >>> > >> grc.io seems to be a good option. Given that we don't need
>> the hosting server name in the image name makes it easily changeable later.
>> >>> > >>
>> >>> > >> Docker container for Apache Flink is named "flink" and they
>> have different tags for different releases and configurations
>> https://hub.docker.com/_/flink .We can follow a similar model and can
>> name the image as "beam" (beam doesn't seem to be taken on docker hub) and
>> use tags to distinguish Java/Python/Go and versions etc.
>> >>> > >>
>> >>> > >> Tags will look like:
>> >>> > >> java-SNAPSHOT
>> >>> > >> java-2.10.1
>> >>> > >> python2-SNAPSHOT
>> >>> > >> python2-2.10.1
>> >>> > >> go-SNAPSHOT
>> >>> > >> go-2.10.1
>> >>> > >>
>> >>> > >>
>> >>> > >> On Wed, Jan 16, 2019 at 5:56 PM Ahmet Altay <
>> al...@google.com> wrote:
>> >>> > >>>
>> >>> > >>> For snapshots, we could use gcr.io. Permission would not
>> be a problem since Jenkins is already correctly setup. The cost will be
>> covered under apache-beam-testing project. And since this is only for
>> snapshots, it will be only for temporary artifacts not for release
>> artifacts.
>> >>> > >>>
>> >>> > >>> On Wed, Jan 16, 2019 at 5:50 PM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>> >>> > 
>> >>> >  +1, releasing containers is a useful process that we need
>> to build in Beam and it is required for FnApi users. Among other reasons,
>> having officially-released Beam SDK harness container images will make it
>> easier for users to do simple customizations to  container images, as they
>> will be able to use container image released by Beam as a base image.
>> >>> > 
>> >>> >  Good point about potential storage limitations on Bintray.
>> With Beam 

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-29 Thread Ankur Goenka
Thanks Valentyn for providing the fix for the BQ blocker.
All other cherrypicks are also done mentioned in the mail thread.
I will start RC2 now.

Thanks,
Ankur

On Wed, May 29, 2019 at 2:47 PM Valentyn Tymofieiev 
wrote:

> Hi Vasiullah,
>
> I am not aware of such function. I suggest that you start a new thread on
> u...@beam.apache.org mailing list for this question and describe your
> use-case there.
>
> On Wed, May 29, 2019 at 2:39 PM Vasiullah syed  wrote:
>
>> Hello All,
>>
>>
>>   Can anyone please help me out with one inbuilt function
>> which is there in apache spark with name as Monotonically increasing id is
>> there any smilar kind in apache beam if so please  revert it with more in
>> detail thanks in advance
>>
>> On Tue, May 28, 2019 at 9:49 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> -1.
>>> I would like us to fix
>>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a
>>> regression that happened in 2.12.0, but was not caught by existing tests.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka  wrote:
>>>
 Hi everyone,

 Please review and vote on the release candidate #1 for the version
 2.13.0, as follows:

 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific comments)

 The complete staging area is available for your review, which includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to dist.apache.org
 [2], which is signed with the key with fingerprint
 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
 * all artifacts to be deployed to the Maven Central Repository [4],
 * source code tag "v2.13.0-RC1" [5],
 * website pull request listing the release [6] and publishing the API
 reference manual [7].
 * Python artifacts are deployed along with the source release to the
 dist.apache.org [2].
 * Validation sheet with a tab for 2.13.0 release to help with
 validation [8].

 The vote will be open for at least 72 hours. It is adopted by majority
 approval, with at least 3 PMC affirmative votes.

 Thanks,
 Ankur

 [1]
 https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
 [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1069/
 [5] https://github.com/apache/beam/tree/v2.13.0-RC1
 [6] https://github.com/apache/beam/pull/8645
 [7] https://github.com/apache/beam-site/pull/589
 [8]
 https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952

>>>


Re: Definition of Unified model

2019-05-29 Thread Jan Lukavský
> Offsets within a file, unordered between files seems exactly 
analogous with offsets within a partition, unordered between partitions, 
right?


Not exactly. The key difference is in that partitions in streaming 
stores are defined (on purpose, and with key impact on this discussion) 
as unbounded sequence of appends. Files, on the other hand are always of 
finite size. This difference makes the semantics of offsets in 
partitioned stream useful, because the are guaranteed to only increase. 
On batch stores as files, these offsets would have to start from zero 
after some (finite) time, which makes them useless for comparison.


On 5/29/19 2:44 PM, Robert Bradshaw wrote:

On Tue, May 28, 2019 at 12:18 PM Jan Lukavský  wrote:

As I understood it, Kenn was supporting the idea that sequence metadata
is preferable over FIFO. I was trying to point out, that it even should
provide the same functionally as FIFO, plus one important more -
reproducibility and ability to being persisted and reused the same way
in batch and streaming.

There is no doubt, that sequence metadata can be stored in every
storage. But, regarding some implicit ordering that sources might have -
yes, of course, data written into HDFS or Cloud Storage has ordering,
but only partial - inside some bulk (e.g. file) and the ordering is not
defined correctly on boundaries of these bulks (between files). That is
why I'd say, that ordering of sources is relevant only for
(partitioned!) streaming sources and generally always reduces to
sequence metadata (e.g. offsets).

Offsets within a file, unordered between files seems exactly analogous
with offsets within a partition, unordered between partitions, right?


On 5/28/19 11:43 AM, Robert Bradshaw wrote:

Huge +1 to all Kenn said.

Jan, batch sources can have orderings too, just like Kafka. I think
it's reasonable (for both batch and streaming) that if a source has an
ordering that is an important part of the data, it should preserve
this ordering into the data itself (e.g. as sequence numbers, offsets,
etc.)

On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles  wrote:

I strongly prefer explicit sequence metadata over FIFO requirements, because:

   - FIFO is complex to specify: for example Dataflow has "per stage key-to-key" FIFO 
today, but it is not guaranteed to remain so (plus "stage" is not a portable concept, nor 
even guaranteed to remain a Dataflow concept)
   - complex specifications are by definition poor usability (if necessary, 
then it is what it is)
   - overly restricts the runner, reduces parallelism, for example any non-stateful ParDo 
has per-element parallelism, not per "key"
   - another perspective on that: FIFO makes everyone pay rather than just the 
transform that requires exactly sequencing
   - previous implementation details like reshuffles become part of the model
   - I'm not even convinced the use cases involved are addressed by some careful FIFO 
restrictions; many sinks re-key and they would all have to become aware of how keying of 
a sequence of "stages" affects the end-to-end FIFO

A noop becoming a non-noop is essentially the mathematical definition of moving 
from higher-level to lower-level abstraction.

So this strikes at the core question of what level of abstraction Beam aims to 
represent. Lower-level means there are fewer possible implementations and it is 
more tied to the underlying architecture, and anything not near-exact match 
pays a huge penalty. Higher-level means there are more implementations possible 
with different tradeoffs, though they may all pay a minor penalty.

I could be convinced to change my mind, but it needs some extensive design, 
examples, etc. I think it is probably about the most consequential design 
decision in the whole Beam model, around the same level as the decision to use 
ParDo and GBK as the primitives IMO.

Kenn


Re: Question about building Go SDK

2019-05-29 Thread Robert Burke
Not a bother at all! The Gradle based build for the Go SDK is brittle, and
there are a few issues around it. Notably that it's doesn't really line up
with how end users will typically acquire dependencies, using the regular
go toolchain.

As Luke says, if you're working with or *on* the SDK, one can live almost
entirely on the standard go toolchain.
It's a goal to get the Gradle dependencies to be backed by Go Modules in
the next few months. This will make sure that what beam is testing more
closely match what users will depend on.

If you're trying out the SDK, the main need for Gradle is to build an SDK
image for yourself. I don't believe we publish common use worker images for
the experimental SDK at this time.

On Wed, May 29, 2019, 2:39 PM Lukasz Cwik  wrote:

> I have gotten the same error on other dependencies. Gogradle sometimes
> fails to pull down the dependencies. Retrying ./gradlew
> :sdks:go:resolveBuildDependencies in the past did allow me to fetch the
> dependencies. Others who focus only on the Go SDK work using the standard
> Go development process by checking out the apache beam package into their
> GOPATH editing it there. I rarely do this because I typically focus on
> changes that are either outside the Go SDK or span multiple languages.
>
>
> On Wed, May 29, 2019 at 5:00 AM Kengo Seki  wrote:
>
>> It was my fault, probably my gradle cache was broken.
>> Running gradlew succeeded after removing ~/.gradle.
>> I'm sorry for bothering you.
>>
>> Kengo Seki 
>>
>> On Wed, May 29, 2019 at 6:45 PM Kengo Seki  wrote:
>> >
>> > Hi Beam developers,
>> >
>> > I tried to build Go SDK on the master branch, but encountered the
>> > following error.
>> >
>> > ```
>> > $ ./gradlew :sdks:go:resolveBuildDependencies
>> >
>> > (snip)
>> >
>> > FAILURE: Build failed with an exception.
>> >
>> > * What went wrong:
>> > Execution failed for task ':sdks:go:resolveBuildDependencies'.
>> > > Exception in resolution, message is:
>> >   Cannot resolve dependency:github.com/coreos/etcd:
>> > commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921',
>> > urls=[https://github.com/co
>> > reos/etcd.git, g...@github.com:coreos/etcd.git]
>> >   Resolution stack is:
>> >   +- github.com/apache/beam/sdks/go
>> > ```
>> >
>> > Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in
>> > gogradle.lock seems to work.
>> >
>> > ```
>> > $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock
>> > $ ./gradlew :sdks:go:resolveBuildDependencies
>> >
>> > (snip)
>> >
>> > BUILD SUCCESSFUL in 4s
>> > 2 actionable tasks: 2 executed
>> > ```
>> >
>> > But I'm not sure whether it's a valid solution, since I'm not familiar
>> > with Gogradle and it seems that gogradle.lock shouldn't be edited
>> > directly according to its header comment.
>> > Is the above approach valid, or is there a more proper way to solve
>> > this problem...?
>> >
>> > Kengo Seki 
>>
>


Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-29 Thread Valentyn Tymofieiev
Hi Vasiullah,

I am not aware of such function. I suggest that you start a new thread on
u...@beam.apache.org mailing list for this question and describe your
use-case there.

On Wed, May 29, 2019 at 2:39 PM Vasiullah syed  wrote:

> Hello All,
>
>
>   Can anyone please help me out with one inbuilt function
> which is there in apache spark with name as Monotonically increasing id is
> there any smilar kind in apache beam if so please  revert it with more in
> detail thanks in advance
>
> On Tue, May 28, 2019 at 9:49 PM Valentyn Tymofieiev 
> wrote:
>
>> -1.
>> I would like us to fix
>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a
>> regression that happened in 2.12.0, but was not caught by existing tests.
>>
>> Thanks,
>> Valentyn
>>
>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka  wrote:
>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the version
>>> 2.13.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific comments)
>>>
>>> The complete staging area is available for your review, which includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to dist.apache.org
>>> [2], which is signed with the key with fingerprint
>>> 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>>> * all artifacts to be deployed to the Maven Central Repository [4],
>>> * source code tag "v2.13.0-RC1" [5],
>>> * website pull request listing the release [6] and publishing the API
>>> reference manual [7].
>>> * Python artifacts are deployed along with the source release to the
>>> dist.apache.org [2].
>>> * Validation sheet with a tab for 2.13.0 release to help with validation
>>> [8].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by majority
>>> approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> [1]
>>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1069/
>>> [5] https://github.com/apache/beam/tree/v2.13.0-RC1
>>> [6] https://github.com/apache/beam/pull/8645
>>> [7] https://github.com/apache/beam-site/pull/589
>>> [8]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>>
>>


Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Katarzyna Kucharczyk
I think all of mentioned arguments are reasonable. As Ismaël said this all
seems to be a tradeoff but maybe worth of considering.

Maybe we should think why it’s worth to introduce a tool which will make us
strictly use PEP8.

First of all, it should help with good coding practices. For example if we
will have 4 spaces and if someone repeatedly nests a lot of constructions,
then this code will be more difficult to read. So the author (hopefully)
will refactor it and maybe split into functions.

Also most of rules would help keep our future commits more clear. For
example by adding comma at the end of the list, adding new item would be
change of 1 line instead of 2.

I agree with Max, it’s understandable this may be a bit problematic in
searching git history, but I guess this is same case as in Java SDK and
spotless. Maybe there is somewhere solution to better filter that? After
fast research I found this [1] (so maybe solution is in git?).

What else actually we gain? My guess is faster PR review iteration. We will
skip some of conversations about code style. Also everything will be
formatted in a same way, so eventually it would be easier to read.

Moreover, as it was mentioned in previous emails - a lot of Jenkins
failures won’t take place, so we save time and resources.

Last but not least, new contributor may be less discouraged. When I started
contribute I didn’t know how to format my code and I lost a lot of time to
add pylint and adjust IntelliJ. I eventually failed. Currently I write code
intuitively and when I don’t forget I rerun tox.

One of disadvantages is that our pipelines has custom syntax and after
formatting they looks a little bit weird, but maybe extending the only
configurable option in Black - lines, from 88 to 110 would be solution.

Second one is that Black requires Python 3 to be run. I don’t know how big
obstacle it would be.

I believe there are two options how it would be possible to introduce
Black. First: just do it, it will hurt but then it would be ok (same as a
dentist appointment). Of course it may require some work to adjust linters.
On the other hand we can do it gradually and start including sdk parts one
by one - maybe it will be less painful?

As an example I can share one of projects [2] I know that uses Black (they
use also other cool checkers and pre-commit [3]). This is how looks their
build with all checks [4].

To sum up I believe that if we want improve our coding experience, we
should improve our toolset. Black seems be recent and quite popular tool
what makes think they won’t stop developing it.

[1]
https://stackoverflow.com/questions/4112410/git-change-styling-whitespace-without-changing-ownership-blame

[2]  https://github.com/GoogleCloudPlatform/oozie-to-airflow

[3] https://pre-commit.com

[4]
https://travis-ci.org/GoogleCloudPlatform/oozie-to-airflow/builds/538725689


On Wed, May 29, 2019 at 2:01 PM Robert Bradshaw  wrote:

> Reformatting to 4 spaces seems a non-starter to me, as it would change
> nearly every single line in the codebase (and the loss of all context as
> well as that particular line).
>
> This is probably why the 2-space fork exists. However, we don't conform to
> that either--we use 2 spaces for indentation, but 4 for continuation
> indentation. (As for the history of this, this goes back to Google's
> internal style guide, probably motivated by consistency with C++, Java, ...
> and the fact that with an indent level of 4 one ends up wrapping lines
> quite frequently (it's telling that black's default line length is 88)).
> This turns out to be an easy change to the codebase.
>
> Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool
> introduces a huge amount of vertical whitespace (e.g. closing parentheses
> on their own line), e.g.
>
> def foo(
> args
> ):
>   if (
>   long expression)
>   ):
> func(
> args
> )
>
> I wrote a simple post-processor to put closing parentheses on the same
> lines, as well as omit the newline after "if (", and disabling formatting
> of strings, which reduce the churn in our codebase to 15k lines (adding
> about 4k) out of 200k total.
>
> https://github.com/apache/beam/pull/8712/files
>
> It's still very opinionated, often in different ways then me, and doesn't
> understand the semantics of the code, but possibly something we could live
> with given the huge advantages of an autoformatter.
>
> An intermediate point would be to allow, but not require, autoformatting
> of changed lines.
>
> As for being beta quality, it looks like it's got a decent number of
> contributors and in my book being in the python github project is a strong
> positive signal. But, due to the above issues, I think we'd have to
> maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue
> is a 2-line change, and the rest implemented as a post-processing step (for
> now, incomplete), so it'd be easy to stay in sync with upstream.)
>
> On Wed, May 29, 2019 at 11:03 AM Ismaël 

Re: Definition of Unified model

2019-05-29 Thread Lukasz Cwik
Expanding the dimensionality could be the basis for loops within the graph
since loops could be modeled as (time, loop iteration #, nested loop
iteration #, nested nested loop iteration #, ...)

On Tue, May 28, 2019 at 12:10 PM Jan Lukavský  wrote:

> Could this be solved by "expanding the dimensionality" of time field? What
> I mean by that - if input element to to FlatMap has sequence number T, then
> the (stateless) FlatMap knows the ordering of output elements, right? If it
> would expand the field by which it will next sort the elements to (X, 1),
> (X, 2), ... (X, N), then it would be possible to sort the elements back
> later. There seems to be no need for state to achieve that, or?
>
> Jan
> On 5/28/19 6:52 PM, Reuven Lax wrote:
>
> A slightly larger concern: it also will force users to create stateful
> DoFns everywhere to generate these sequence numbers. If I have a ParDo that
> is not a simple 1:1 transform (i.e. not MapElements), then the ParDo will
> need to generate its own sequence numbers for ordering, and the only safe
> way to do so is to use a stateful DoFn. This turns what used to be a simple
> in-memory DoFn into one that has to access state. Also I believe many
> runners will not fuse stateful DoFns. While none of this poses a problem
> for the model, it could make ordering extremely expensive to achieve.
>
> Reuven
>
> On Tue, May 28, 2019 at 6:09 AM Jan Lukavský  wrote:
>
>> Hi Reuven,
>>
>> > It also gets awkward with Flatten - the sequence number is no longer
>> enough, you must also encode which side of the flatten each element came
>> from.
>>
>> That is a generic need. Even if you read data from Kafka, the offsets are
>> comparable only inside single partition. So, for Kafka to work as a FIFO
>> for ordering, elements with same key have to be pushed to the same
>> partition (otherwise Kafka cannot act as FIFO, because different partitions
>> can be handled by different brokers, which means different observers and
>> they therefore might not agree on the order of events). So if we want to
>> emulate FIFO per key, then the sequence IDs have also be per key.
>> On 5/28/19 2:33 PM, Reuven Lax wrote:
>>
>> Sequence metadata does have the disadvantage that users can no longer use
>> the types coming from the source. You must create a new type that contains
>> a sequence number (unless Beam provides this). It also gets awkward with
>> Flatten - the sequence number is no longer enough, you must also encode
>> which side of the flatten each element came from.
>>
>> On Tue, May 28, 2019 at 3:18 AM Jan Lukavský  wrote:
>>
>>> As I understood it, Kenn was supporting the idea that sequence metadata
>>> is preferable over FIFO. I was trying to point out, that it even should
>>> provide the same functionally as FIFO, plus one important more -
>>> reproducibility and ability to being persisted and reused the same way
>>> in batch and streaming.
>>>
>>> There is no doubt, that sequence metadata can be stored in every
>>> storage. But, regarding some implicit ordering that sources might have -
>>> yes, of course, data written into HDFS or Cloud Storage has ordering,
>>> but only partial - inside some bulk (e.g. file) and the ordering is not
>>> defined correctly on boundaries of these bulks (between files). That is
>>> why I'd say, that ordering of sources is relevant only for
>>> (partitioned!) streaming sources and generally always reduces to
>>> sequence metadata (e.g. offsets).
>>>
>>> Jan
>>>
>>> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
>>> > Huge +1 to all Kenn said.
>>> >
>>> > Jan, batch sources can have orderings too, just like Kafka. I think
>>> > it's reasonable (for both batch and streaming) that if a source has an
>>> > ordering that is an important part of the data, it should preserve
>>> > this ordering into the data itself (e.g. as sequence numbers, offsets,
>>> > etc.)
>>> >
>>> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles 
>>> wrote:
>>> >> I strongly prefer explicit sequence metadata over FIFO requirements,
>>> because:
>>> >>
>>> >>   - FIFO is complex to specify: for example Dataflow has "per stage
>>> key-to-key" FIFO today, but it is not guaranteed to remain so (plus "stage"
>>> is not a portable concept, nor even guaranteed to remain a Dataflow concept)
>>> >>   - complex specifications are by definition poor usability (if
>>> necessary, then it is what it is)
>>> >>   - overly restricts the runner, reduces parallelism, for example any
>>> non-stateful ParDo has per-element parallelism, not per "key"
>>> >>   - another perspective on that: FIFO makes everyone pay rather than
>>> just the transform that requires exactly sequencing
>>> >>   - previous implementation details like reshuffles become part of
>>> the model
>>> >>   - I'm not even convinced the use cases involved are addressed by
>>> some careful FIFO restrictions; many sinks re-key and they would all have
>>> to become aware of how keying of a sequence of "stages" affects the
>>> end-to-end FIFO
>>> >>

Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Ahmet Altay
Thank you all for doing the work. The demo PR looks decent. I still have 3
major concerns. (1) I prefer not to maintain a fork with our own
customization for a tool. It will add up to our maintenance costs. (2) Even
the small change touches about 20k lines of code. (3) On beta note, the
specific note is "until the formatter becomes stable, you should expect
some formatting to change in the future." That signals like we might see
formatting churn.

And as a minor point I do not understand why some reformatting happened.
For example, was the following change due to pep-8 compliance or the tool
being opinionated?

if (not hasattr(result, 'has_job')# direct runner
or result.has_job):   # not just a template creation

-->

if (not hasattr(result, 'has_job') or result.has_job  # direct runner
 ):  # not just a template creation

So far all of us agree on benefits of an autoformatter. I do not wish to
block this. Consider me -0 as long as someone is willing to maintain a fork
that works for our purposes.

Ahmet

On Wed, May 29, 2019 at 5:01 AM Robert Bradshaw  wrote:

> Reformatting to 4 spaces seems a non-starter to me, as it would change
> nearly every single line in the codebase (and the loss of all context as
> well as that particular line).
>
> This is probably why the 2-space fork exists. However, we don't conform to
> that either--we use 2 spaces for indentation, but 4 for continuation
> indentation. (As for the history of this, this goes back to Google's
> internal style guide, probably motivated by consistency with C++, Java, ...
> and the fact that with an indent level of 4 one ends up wrapping lines
> quite frequently (it's telling that black's default line length is 88)).
> This turns out to be an easy change to the codebase.
>
> Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool
> introduces a huge amount of vertical whitespace (e.g. closing parentheses
> on their own line), e.g.
>
> def foo(
> args
> ):
>   if (
>   long expression)
>   ):
> func(
> args
> )
>
> I wrote a simple post-processor to put closing parentheses on the same
> lines, as well as omit the newline after "if (", and disabling formatting
> of strings, which reduce the churn in our codebase to 15k lines (adding
> about 4k) out of 200k total.
>
> https://github.com/apache/beam/pull/8712/files
>
> It's still very opinionated, often in different ways then me, and doesn't
> understand the semantics of the code, but possibly something we could live
> with given the huge advantages of an autoformatter.
>
> An intermediate point would be to allow, but not require, autoformatting
> of changed lines.
>
> As for being beta quality, it looks like it's got a decent number of
> contributors and in my book being in the python github project is a strong
> positive signal. But, due to the above issues, I think we'd have to
> maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue
> is a 2-line change, and the rest implemented as a post-processing step (for
> now, incomplete), so it'd be easy to stay in sync with upstream.)
>
> On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía  wrote:
> >
> > > I think the question is if it can be configured in a way to fit our
> > > current linter's style. I don't think it is feasible to reformat the
> > > entire Python SDK.
> >
> > It cannot be configured to do what we actually do because Black is
> > configurable only to support the standard python codestyle guidelines
> > (PEP-8) which recommends 4 spaces and is what most projects in the
> > python world use.
> >
> > > Reformatted lines don't allow quick access to the Git history. This
> > > effect is still visible in the Java SDK. However, I have the feeling
> > > that this might be less of a problem with Python because the linter has
> > > more rules than Checkstyle had.
> >
> > Yes that’s the bad side effect but there are always tradeoffs we have
> > to deal with.
> >
> >
> >
> >
> > On Wed, May 29, 2019 at 10:52 AM Maximilian Michels 
> wrote:
> > >
> > > I think the question is if it can be configured in a way to fit our
> > > current linter's style. I don't think it is feasible to reformat the
> > > entire Python SDK.
> > >
> > > Reformatted lines don't allow quick access to the Git history. This
> > > effect is still visible in the Java SDK. However, I have the feeling
> > > that this might be less of a problem with Python because the linter has
> > > more rules than Checkstyle had.
> > >
> > > -Max
> > >
> > > On 29.05.19 10:16, Ismaël Mejía wrote:
> > > >> My concerns are:
> > > >> - The product is clearly marked as beta with a big warning.
> > > >> - It looks like mostly a single person project. For the same reason
> I also strongly prefer not using a fork for a specific setting. Fork will
> only have less people looking at it.
> > > >
> > > > I suppose the project is marked as beta because it is recent, it was
> > > > presented in 2018’s pycon, and because some 

Re: Question about building Go SDK

2019-05-29 Thread Lukasz Cwik
I have gotten the same error on other dependencies. Gogradle sometimes
fails to pull down the dependencies. Retrying ./gradlew
:sdks:go:resolveBuildDependencies in the past did allow me to fetch the
dependencies. Others who focus only on the Go SDK work using the standard
Go development process by checking out the apache beam package into their
GOPATH editing it there. I rarely do this because I typically focus on
changes that are either outside the Go SDK or span multiple languages.


On Wed, May 29, 2019 at 5:00 AM Kengo Seki  wrote:

> It was my fault, probably my gradle cache was broken.
> Running gradlew succeeded after removing ~/.gradle.
> I'm sorry for bothering you.
>
> Kengo Seki 
>
> On Wed, May 29, 2019 at 6:45 PM Kengo Seki  wrote:
> >
> > Hi Beam developers,
> >
> > I tried to build Go SDK on the master branch, but encountered the
> > following error.
> >
> > ```
> > $ ./gradlew :sdks:go:resolveBuildDependencies
> >
> > (snip)
> >
> > FAILURE: Build failed with an exception.
> >
> > * What went wrong:
> > Execution failed for task ':sdks:go:resolveBuildDependencies'.
> > > Exception in resolution, message is:
> >   Cannot resolve dependency:github.com/coreos/etcd:
> > commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921',
> > urls=[https://github.com/co
> > reos/etcd.git, g...@github.com:coreos/etcd.git]
> >   Resolution stack is:
> >   +- github.com/apache/beam/sdks/go
> > ```
> >
> > Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in
> > gogradle.lock seems to work.
> >
> > ```
> > $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock
> > $ ./gradlew :sdks:go:resolveBuildDependencies
> >
> > (snip)
> >
> > BUILD SUCCESSFUL in 4s
> > 2 actionable tasks: 2 executed
> > ```
> >
> > But I'm not sure whether it's a valid solution, since I'm not familiar
> > with Gogradle and it seems that gogradle.lock shouldn't be edited
> > directly according to its header comment.
> > Is the above approach valid, or is there a more proper way to solve
> > this problem...?
> >
> > Kengo Seki 
>


Re: Shuffling on apache beam

2019-05-29 Thread Pablo Estrada
If you add a stateful DoFn to your pipeline, you'll force Beam to shuffle
data to their corresponding worker per key. I am not sure what is the
latency cost of doing this (as the messages still need to be shuffled). But
it may help you accomplish this without adding windowing+triggering.

-P.

On Wed, May 29, 2019 at 5:16 AM pasquale.bon...@gmail.com <
pasquale.bon...@gmail.com> wrote:

> Hi Reza,
> with GlobalWindow with triggering I was able to reduce hotspot issues
> gaining satisfying performance for BigTable update. Unfortunately latency
> when getting messages from PubSub remains around 1.5s that it's too much
> considering our NFR.
>
> This is the code I use to get the messages:
> PCollectionTuple rawTransactions = p //
> .apply("GetMessages",
>
> PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME)
>
> .withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic))
> .apply(Window.configure()
> .triggering(Repeatedly
>
> .forever(AfterWatermark.pastEndOfWindow()
> .withEarlyFirings(
> AfterProcessingTime
>
> .pastFirstElementInPane()
>
> .plusDelayOf(Duration.millis(1)))
> // Fire on any
> late data
>
> .withLateFirings(AfterPane.elementCountAtLeast(1
> .discardingFiredPanes())
>
> Messages are produced with a different dataflow:
>  Pipeline p = Pipeline.create(options);
> p.apply(
> "ReadFile",
> TextIO.read()
> .from(options.getInputLocation() + "/*.csv")
> .watchForNewFiles(
> // Check for new files every 1 seconds
> Duration.millis(600),
> // Never stop checking for new files
> Watch.Growth.never()))
> .apply(
> "create message",
> ParDo.of(
> new DoFn() {
>   @ProcessElement
>   public void processElement(ProcessContext context) {
> String line = context.element();
>
> String payload = convertRow(line);
> long now =
> LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
> context.output(
> new PubsubMessage(
> payload.getBytes(),
> ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME,
> payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME,
> Long.toString(now;
>   }
> }))
> .apply("publish message", PubsubIO.writeMessages().to(topic));
>
> I'm uploading a file containing 100 rows every 600 ms.
>
> I found different threads on satckoverflow around this latency issue, but
> none has a solution.
>
>
>
>
> On 2019/05/24 07:19:02, Reza Rokni  wrote:
> > PS You can also make use of the GlobalWindow with a stateful DoFn.
> >
> > On Fri, 24 May 2019 at 15:13, Reza Rokni  wrote:
> >
> > > Hi,
> > >
> > > Have you explored the use of triggers with your use case?
> > >
> > >
> > >
> https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html
> > >
> > > Cheers
> > >
> > > Reza
> > >
> > > On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com <
> > > pasquale.bon...@gmail.com> wrote:
> > >
> > >> Hi Reuven,
> > >> I would like to know if is possible to guarantee that record are
> > >> processed by the same thread/task based on a key, as probably happens
> in a
> > >> combine/stateful operation, without adding the delay of a windows.
> > >> This could increase efficiency of caching and reduce same racing
> > >> condition when writing data.
> > >> I understand that workers are not part of programming model so I would
> > >> like to know if it's possible to achieve this behaviour reducing at
> minimum
> > >> the delay of windowing. We don't need any combine or state we just
> want the
> > >> all record with a given key are sent to same thread,
> > >>
> > >> Thanks
> > >>
> > >>
> > >> On 2019/05/24 03:20:13, Reuven Lax  wrote:
> > >> > Can you explain what you mean by worker? While every runner has
> workers
> > >> of
> > >> > course, workers are not part of the programming model.
> > >> >
> > >> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com <
> > >> > pasquale.bon...@gmail.com> wrote:
> > >> >
> > >> > > Hi all,
> > >> > > I would like to know if Apache Beam has a functionality similar to
> > >> > > fieldsGrouping in Storm that allows to send records to a specific
> > >> > > task/worker based on a key.
> > >> > > I know that we can achieve that with a combine/grouByKey
> operation but
> > >> > > that implies to add a windowing in our pipeline that we don't
> want.
> > >> > > 

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-29 Thread Ahmet Altay
We have a quite a bit of cherry pick requests. Are they all for
major/blocking issues? Have we uncovered issues in release validation that
is normally missing in our daily tests?

On Wed, May 29, 2019 at 10:20 AM Thomas Weise  wrote:

> Added: https://github.com/apache/beam/pull/8714
>
>
> On Tue, May 28, 2019 at 4:03 PM Ankur Goenka  wrote:
>
>> Open cherry pick PRs for spark runner
>> https://github.com/apache/beam/pull/8705
>> https://github.com/apache/beam/pull/8706
>>
>> On Tue, May 28, 2019 at 3:42 PM Valentyn Tymofieiev 
>> wrote:
>>
>>> Yes, looking into that.
>>>
>>> On Tue, May 28, 2019 at 3:37 PM Ankur Goenka  wrote:
>>>
 Valentyn, Can you please send the cherry pick PR for
 https://issues.apache.org/jira/browse/BEAM-7439

 On Tue, May 28, 2019 at 3:04 PM Ankur Goenka  wrote:

> Sure, I will cherry pick those PRs.
>
> On Tue, May 28, 2019 at 2:19 PM Kyle Weaver 
> wrote:
>
>> Hi Ankur,
>>
>> It's not a blocker, but I'd like to see
>> https://github.com/apache/beam/pull/8558 and
>> https://github.com/apache/beam/pull/8569 be included so TFX examples
>> can be run without errors on the 2.13.0 Spark runner (
>> https://github.com/tensorflow/tfx/pull/84).
>>
>> Kyle Weaver | Software Engineer | github.com/ibzib |
>> kcwea...@google.com | +1650203
>>
>>
>> On Tue, May 28, 2019 at 11:53 AM Ankur Goenka 
>> wrote:
>>
>>> Thanks for the validation.
>>>
>>> I have marked fixed version of
>>> https://issues.apache.org/jira/browse/BEAM-7406
>>> https://issues.apache.org/jira/browse/BEAM-6380 to be 2.13.0 and
>>> will cherry pick the associated commits to the jira.
>>>
>>>
>>> On Tue, May 28, 2019 at 11:19 AM Lukasz Cwik 
>>> wrote:
>>>
 I would also suggest to get
 https://github.com/apache/beam/pull/8668 in to 2.13.0 since it
 fixes a logging setup issue on Dataflow (BEAM-7406).

 On Tue, May 28, 2019 at 10:22 AM Chamikara Jayalath <
 chamik...@google.com> wrote:

> I would also like to get https://github.com/apache/beam/pull/8661 in
> to 2.13.0 that fixes
> https://issues.apache.org/jira/browse/BEAM-6380. It's not a new
> issue but has affected a number of users.
>
> - Cham
>
> On Tue, May 28, 2019 at 9:31 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> Thanks, Juta Staes, for reporting this issue.
>>
>> On Tue, May 28, 2019, 9:19 AM Valentyn Tymofieiev <
>> valen...@google.com> wrote:
>>
>>> -1.
>>> I would like us to fix
>>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It
>>> is a regression that happened in 2.12.0, but was not caught by 
>>> existing
>>> tests.
>>>
>>> Thanks,
>>> Valentyn
>>>
>>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka 
>>> wrote:
>>>
 Hi everyone,

 Please review and vote on the release candidate #1 for the
 version 2.13.0, as follows:

 [ ] +1, Approve the release
 [ ] -1, Do not approve the release (please provide specific
 comments)

 The complete staging area is available for your review, which
 includes:
 * JIRA release notes [1],
 * the official Apache source release to be deployed to
 dist.apache.org [2], which is signed with the key with
 fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
 * all artifacts to be deployed to the Maven Central Repository
 [4],
 * source code tag "v2.13.0-RC1" [5],
 * website pull request listing the release [6] and publishing
 the API reference manual [7].
 * Python artifacts are deployed along with the source release
 to the dist.apache.org [2].
 * Validation sheet with a tab for 2.13.0 release to help with
 validation [8].

 The vote will be open for at least 72 hours. It is adopted by
 majority approval, with at least 3 PMC affirmative votes.

 Thanks,
 Ankur

 [1]
 https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
 [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
 [3] https://dist.apache.org/repos/dist/release/beam/KEYS
 [4]
 https://repository.apache.org/content/repositories/orgapachebeam-1069/
 [5] https://github.com/apache/beam/tree/v2.13.0-RC1
 [6] https://github.com/apache/beam/pull/8645
 [7] https://github.com/apache/beam-site/pull/589
 [8]
 

Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Thomas Weise
Thanks Ismaël for bringing this up. Dealing with Py lint issues has been
painful and unproductive.

For the Java side it became ./gradlew spotlessApply and there is no looking
back. To get there it was also necessary to reformat some of the code, but
the net benefits are evident.

Even if we have to maintain a fork for the formatter, it will probably
still be a big win.

Thomas


On Wed, May 29, 2019 at 5:01 AM Robert Bradshaw  wrote:

> Reformatting to 4 spaces seems a non-starter to me, as it would change
> nearly every single line in the codebase (and the loss of all context as
> well as that particular line).
>
> This is probably why the 2-space fork exists. However, we don't conform to
> that either--we use 2 spaces for indentation, but 4 for continuation
> indentation. (As for the history of this, this goes back to Google's
> internal style guide, probably motivated by consistency with C++, Java, ...
> and the fact that with an indent level of 4 one ends up wrapping lines
> quite frequently (it's telling that black's default line length is 88)).
> This turns out to be an easy change to the codebase.
>
> Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool
> introduces a huge amount of vertical whitespace (e.g. closing parentheses
> on their own line), e.g.
>
> def foo(
> args
> ):
>   if (
>   long expression)
>   ):
> func(
> args
> )
>
> I wrote a simple post-processor to put closing parentheses on the same
> lines, as well as omit the newline after "if (", and disabling formatting
> of strings, which reduce the churn in our codebase to 15k lines (adding
> about 4k) out of 200k total.
>
> https://github.com/apache/beam/pull/8712/files
>
> It's still very opinionated, often in different ways then me, and doesn't
> understand the semantics of the code, but possibly something we could live
> with given the huge advantages of an autoformatter.
>
> An intermediate point would be to allow, but not require, autoformatting
> of changed lines.
>
> As for being beta quality, it looks like it's got a decent number of
> contributors and in my book being in the python github project is a strong
> positive signal. But, due to the above issues, I think we'd have to
> maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue
> is a 2-line change, and the rest implemented as a post-processing step (for
> now, incomplete), so it'd be easy to stay in sync with upstream.)
>
> On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía  wrote:
> >
> > > I think the question is if it can be configured in a way to fit our
> > > current linter's style. I don't think it is feasible to reformat the
> > > entire Python SDK.
> >
> > It cannot be configured to do what we actually do because Black is
> > configurable only to support the standard python codestyle guidelines
> > (PEP-8) which recommends 4 spaces and is what most projects in the
> > python world use.
> >
> > > Reformatted lines don't allow quick access to the Git history. This
> > > effect is still visible in the Java SDK. However, I have the feeling
> > > that this might be less of a problem with Python because the linter has
> > > more rules than Checkstyle had.
> >
> > Yes that’s the bad side effect but there are always tradeoffs we have
> > to deal with.
> >
> >
> >
> >
> > On Wed, May 29, 2019 at 10:52 AM Maximilian Michels 
> wrote:
> > >
> > > I think the question is if it can be configured in a way to fit our
> > > current linter's style. I don't think it is feasible to reformat the
> > > entire Python SDK.
> > >
> > > Reformatted lines don't allow quick access to the Git history. This
> > > effect is still visible in the Java SDK. However, I have the feeling
> > > that this might be less of a problem with Python because the linter has
> > > more rules than Checkstyle had.
> > >
> > > -Max
> > >
> > > On 29.05.19 10:16, Ismaël Mejía wrote:
> > > >> My concerns are:
> > > >> - The product is clearly marked as beta with a big warning.
> > > >> - It looks like mostly a single person project. For the same reason
> I also strongly prefer not using a fork for a specific setting. Fork will
> only have less people looking at it.
> > > >
> > > > I suppose the project is marked as beta because it is recent, it was
> > > > presented in 2018’s pycon, and because some things can change since
> > > > auto-formatters are pretty tricky beasts, I think beta in that case
> is
> > > > like our own ‘@Experimental’. If you look at the contribution page
> [1]
> > > > you can notice that it is less and less a single person project,
> there
> > > > have been 93 independent contributions since the project became
> > > > public, and the fact that it is hosted in the python organization
> > > > github [2] gives some confidence on the project continuity.
> > > >
> > > > You are right however about the fact that the main author seems to be
> > > > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem
> > > > arbitrary, but he 

Re: [VOTE] Release 2.13.0, release candidate #1

2019-05-29 Thread Thomas Weise
Added: https://github.com/apache/beam/pull/8714


On Tue, May 28, 2019 at 4:03 PM Ankur Goenka  wrote:

> Open cherry pick PRs for spark runner
> https://github.com/apache/beam/pull/8705
> https://github.com/apache/beam/pull/8706
>
> On Tue, May 28, 2019 at 3:42 PM Valentyn Tymofieiev 
> wrote:
>
>> Yes, looking into that.
>>
>> On Tue, May 28, 2019 at 3:37 PM Ankur Goenka  wrote:
>>
>>> Valentyn, Can you please send the cherry pick PR for
>>> https://issues.apache.org/jira/browse/BEAM-7439
>>>
>>> On Tue, May 28, 2019 at 3:04 PM Ankur Goenka  wrote:
>>>
 Sure, I will cherry pick those PRs.

 On Tue, May 28, 2019 at 2:19 PM Kyle Weaver 
 wrote:

> Hi Ankur,
>
> It's not a blocker, but I'd like to see
> https://github.com/apache/beam/pull/8558 and
> https://github.com/apache/beam/pull/8569 be included so TFX examples
> can be run without errors on the 2.13.0 Spark runner (
> https://github.com/tensorflow/tfx/pull/84).
>
> Kyle Weaver | Software Engineer | github.com/ibzib |
> kcwea...@google.com | +1650203
>
>
> On Tue, May 28, 2019 at 11:53 AM Ankur Goenka 
> wrote:
>
>> Thanks for the validation.
>>
>> I have marked fixed version of
>> https://issues.apache.org/jira/browse/BEAM-7406
>> https://issues.apache.org/jira/browse/BEAM-6380 to be 2.13.0 and
>> will cherry pick the associated commits to the jira.
>>
>>
>> On Tue, May 28, 2019 at 11:19 AM Lukasz Cwik 
>> wrote:
>>
>>> I would also suggest to get https://github.com/apache/beam/pull/8668
>>> in to 2.13.0 since it fixes a logging setup issue on Dataflow 
>>> (BEAM-7406).
>>>
>>> On Tue, May 28, 2019 at 10:22 AM Chamikara Jayalath <
>>> chamik...@google.com> wrote:
>>>
 I would also like to get https://github.com/apache/beam/pull/8661 in
 to 2.13.0 that fixes
 https://issues.apache.org/jira/browse/BEAM-6380. It's not a new
 issue but has affected a number of users.

 - Cham

 On Tue, May 28, 2019 at 9:31 AM Valentyn Tymofieiev <
 valen...@google.com> wrote:

> Thanks, Juta Staes, for reporting this issue.
>
> On Tue, May 28, 2019, 9:19 AM Valentyn Tymofieiev <
> valen...@google.com> wrote:
>
>> -1.
>> I would like us to fix
>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It
>> is a regression that happened in 2.12.0, but was not caught by 
>> existing
>> tests.
>>
>> Thanks,
>> Valentyn
>>
>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Please review and vote on the release candidate #1 for the
>>> version 2.13.0, as follows:
>>>
>>> [ ] +1, Approve the release
>>> [ ] -1, Do not approve the release (please provide specific
>>> comments)
>>>
>>> The complete staging area is available for your review, which
>>> includes:
>>> * JIRA release notes [1],
>>> * the official Apache source release to be deployed to
>>> dist.apache.org [2], which is signed with the key with
>>> fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3],
>>> * all artifacts to be deployed to the Maven Central Repository
>>> [4],
>>> * source code tag "v2.13.0-RC1" [5],
>>> * website pull request listing the release [6] and publishing
>>> the API reference manual [7].
>>> * Python artifacts are deployed along with the source release to
>>> the dist.apache.org [2].
>>> * Validation sheet with a tab for 2.13.0 release to help with
>>> validation [8].
>>>
>>> The vote will be open for at least 72 hours. It is adopted by
>>> majority approval, with at least 3 PMC affirmative votes.
>>>
>>> Thanks,
>>> Ankur
>>>
>>> [1]
>>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166
>>> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/
>>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS
>>> [4]
>>> https://repository.apache.org/content/repositories/orgapachebeam-1069/
>>> [5] https://github.com/apache/beam/tree/v2.13.0-RC1
>>> [6] https://github.com/apache/beam/pull/8645
>>> [7] https://github.com/apache/beam-site/pull/589
>>> [8]
>>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952
>>>
>>


Re: Hazelcast Jet Runner

2019-05-29 Thread Reza Rokni
Hi,

Over 800 usages under java, might be worth doing a few PR...

Also suggest we use a very light review process: First round go for low
hanging fruit, if anyone does a -1 against a change then we leave that for
round two.

Thoughts?

Cheers

Reza

On Wed, 29 May 2019 at 12:05, Kenneth Knowles  wrote:

>
>
> On Mon, May 27, 2019 at 4:05 PM Reza Rokni  wrote:
>
>> "Many APIs that have been in place for years and are used by most Beam
>> users are still marked Experimental."
>>
>> Should there be a formal process in place to start 'graduating' features
>> out of @Experimental? Perhaps even target an up coming release with a PR to
>> remove the annotation from well established API's?
>>
>
> Good idea. I think a PR like this would be an opportunity to discuss
> whether the feature is non-experimental. Probably many of them are ready.
> It would help to address Ismael's very good point that this new practice
> could make users think the old Experimental stuff is not experimental.
> Maybe it is true that it is not really still Experimental.
>
> Kenn
>
>
>
>> On Tue, 28 May 2019 at 06:44, Reuven Lax  wrote:
>>
>>> We generally use Experimental for two different things, which leads to
>>> confusion.
>>>   1. Features that work stably, but where we think we might still make
>>> some changes to the API.
>>>   2. New features that we think might not yet be stable.
>>>
>>> This dual usage leads to a lot of confusion IMO. The fact that we tend
>>> to forget to remove the @Experimental tag also makes it somewhat useless.
>>> Many APIs that have been in place for years and are used by most Beam users
>>> are still marked Experimental.
>>>
>>> Reuven
>>>
>>> On Mon, May 27, 2019 at 2:16 PM Ismaël Mejía  wrote:
>>>
 > Personally, I think that it is good that moving from experimental to
 non-experimental is a breaking change in the dependency - one has
 backwards-incompatible changes and the other does not. If artifacts had
 separate versioning we could use 0.x for this.

 In theory it seems so, but in practice it is an annoyance to an end
 user that already took the ‘risk’ of using an experimental feature.
 Awareness is probably not the most important reason to break existing
 code (even if it could be easily fixed). The alternative of doing this
 with version numbers at least seems less impacting but can be
 confusing.

 > But biggest motivation for me are these:
 >
 >  - using experimental features should be opt-in
 >  - should be impossible to use an experimental feature without
 knowing it (so "opt-in" to a normal-looking feature is not enough)
 > - developers of an experimental feature should be motivated to
 "graduate" it

 The fundamental problem of this approach is inconsistency with our
 present/past. So far we have ‘Experimental’ features everywhere. So
 suddenly becoming opt-in let us in an inconsistent state. For example
 all IOs are marked internally as Experimental but not at the level of
 directories/artifacts. Adding this suffix in a new IO apart of adding
 fear of use to the end users may also give the fake impression that
 the older ones not explicitly marked are not experimental.

 What will be the state for example in the case of runner modules that
 contain both mature and well tested runners like old Flink and Spark
 runners vs the more experimental new translations for Portability,
 again more confusion.

 > FWIW I don't think "experimental" should be viewed as a bad thing. It
 just means you are able to make backwards-incompatible changes, and that
 users should be aware that they will need to adjust APIs (probably only a
 little) with new releases. Most software is not very good until it has been
 around for a long time, and in my experience the problem is missing the
 mark on abstractions, so backwards compatibility *must* be broken to
 achieve quality. Freezing it early dooms it to never achieving high
 quality. I know of projects where the users explicitly requested that the
 developers not freeze the API but instead prioritize speed and quality.

 I agree 100% on the arguments, but let’s think in the reverse terms,
 highlighting lack of maturity can play against the intended goal of
 use and adoption even if for a noble reason. It is basic priming 101
 [1].

 > Maybe the word is just too negative-sounding? Alternatives might be
 "unstable" or "incubating".

 Yes! “experimental” should not be viewed as a bad thing unless you are
 a company that has less resources and is trying to protect its
 investment so in that case they may doubt to use it. In this case
 probably incubating is a better term because it has less of the
 ‘tentative’ dimension associated with Experimental.

 > Now, for the Jet runner, most runners sit on a branch for a while,
 not being 

Re: Definition of Unified model

2019-05-29 Thread Robert Bradshaw
On Tue, May 28, 2019 at 2:34 PM Reuven Lax  wrote:
>
> Sequence metadata does have the disadvantage that users can no longer use the 
> types coming from the source. You must create a new type that contains a 
> sequence number (unless Beam provides this).

Yes. Well, the source would have a withSequenceNumber or withOffest or
similar to provide this capability.

 > It also gets awkward with Flatten - the sequence number is no
longer enough, you must also encode which side of the flatten each
element came from.

Only if you want to order things across different PCollections. But,
yes, you'd probably want to do a tagged flatten.

I still think making it explicit, and providing the ability to
(cheaply) order on it, is better than trying to come up with an
implicit model of where things are and aren't ordered.

(As an aside, we often have various pieces of metadata that are
attached to elements: timestamps, windows, pane infos, possible
trigger metadata (for data-driven triggers), sequence numbers, ...).
It may be interesting to have the ability to project an element into a
"visible" portion and an invisible "attached" portion, where the
latter would flow through just as timestamps do (where possible), and
then later make it visible again. Maybe the work with schema field
unpacking, etc. to only look at the fields one is interested in is
sufficient to sideline the noise of extra fields, but it wouldn't
allow the application of a (more) arbitrary PTransform to a
PCollection.)

> On Tue, May 28, 2019 at 3:18 AM Jan Lukavský  wrote:
>>
>> As I understood it, Kenn was supporting the idea that sequence metadata
>> is preferable over FIFO. I was trying to point out, that it even should
>> provide the same functionally as FIFO, plus one important more -
>> reproducibility and ability to being persisted and reused the same way
>> in batch and streaming.
>>
>> There is no doubt, that sequence metadata can be stored in every
>> storage. But, regarding some implicit ordering that sources might have -
>> yes, of course, data written into HDFS or Cloud Storage has ordering,
>> but only partial - inside some bulk (e.g. file) and the ordering is not
>> defined correctly on boundaries of these bulks (between files). That is
>> why I'd say, that ordering of sources is relevant only for
>> (partitioned!) streaming sources and generally always reduces to
>> sequence metadata (e.g. offsets).
>>
>> Jan
>>
>> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
>> > Huge +1 to all Kenn said.
>> >
>> > Jan, batch sources can have orderings too, just like Kafka. I think
>> > it's reasonable (for both batch and streaming) that if a source has an
>> > ordering that is an important part of the data, it should preserve
>> > this ordering into the data itself (e.g. as sequence numbers, offsets,
>> > etc.)
>> >
>> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles  wrote:
>> >> I strongly prefer explicit sequence metadata over FIFO requirements, 
>> >> because:
>> >>
>> >>   - FIFO is complex to specify: for example Dataflow has "per stage 
>> >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus 
>> >> "stage" is not a portable concept, nor even guaranteed to remain a 
>> >> Dataflow concept)
>> >>   - complex specifications are by definition poor usability (if 
>> >> necessary, then it is what it is)
>> >>   - overly restricts the runner, reduces parallelism, for example any 
>> >> non-stateful ParDo has per-element parallelism, not per "key"
>> >>   - another perspective on that: FIFO makes everyone pay rather than just 
>> >> the transform that requires exactly sequencing
>> >>   - previous implementation details like reshuffles become part of the 
>> >> model
>> >>   - I'm not even convinced the use cases involved are addressed by some 
>> >> careful FIFO restrictions; many sinks re-key and they would all have to 
>> >> become aware of how keying of a sequence of "stages" affects the 
>> >> end-to-end FIFO
>> >>
>> >> A noop becoming a non-noop is essentially the mathematical definition of 
>> >> moving from higher-level to lower-level abstraction.
>> >>
>> >> So this strikes at the core question of what level of abstraction Beam 
>> >> aims to represent. Lower-level means there are fewer possible 
>> >> implementations and it is more tied to the underlying architecture, and 
>> >> anything not near-exact match pays a huge penalty. Higher-level means 
>> >> there are more implementations possible with different tradeoffs, though 
>> >> they may all pay a minor penalty.
>> >>
>> >> I could be convinced to change my mind, but it needs some extensive 
>> >> design, examples, etc. I think it is probably about the most 
>> >> consequential design decision in the whole Beam model, around the same 
>> >> level as the decision to use ParDo and GBK as the primitives IMO.
>> >>
>> >> Kenn
>> >>
>> >> On Thu, May 23, 2019 at 10:17 AM Reuven Lax  wrote:
>> >>> Not really. I'm suggesting that some variant of FIFO ordering is 
>> >>> 

Re: Definition of Unified model

2019-05-29 Thread Robert Bradshaw
On Tue, May 28, 2019 at 12:18 PM Jan Lukavský  wrote:
>
> As I understood it, Kenn was supporting the idea that sequence metadata
> is preferable over FIFO. I was trying to point out, that it even should
> provide the same functionally as FIFO, plus one important more -
> reproducibility and ability to being persisted and reused the same way
> in batch and streaming.
>
> There is no doubt, that sequence metadata can be stored in every
> storage. But, regarding some implicit ordering that sources might have -
> yes, of course, data written into HDFS or Cloud Storage has ordering,
> but only partial - inside some bulk (e.g. file) and the ordering is not
> defined correctly on boundaries of these bulks (between files). That is
> why I'd say, that ordering of sources is relevant only for
> (partitioned!) streaming sources and generally always reduces to
> sequence metadata (e.g. offsets).

Offsets within a file, unordered between files seems exactly analogous
with offsets within a partition, unordered between partitions, right?

> On 5/28/19 11:43 AM, Robert Bradshaw wrote:
> > Huge +1 to all Kenn said.
> >
> > Jan, batch sources can have orderings too, just like Kafka. I think
> > it's reasonable (for both batch and streaming) that if a source has an
> > ordering that is an important part of the data, it should preserve
> > this ordering into the data itself (e.g. as sequence numbers, offsets,
> > etc.)
> >
> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles  wrote:
> >> I strongly prefer explicit sequence metadata over FIFO requirements, 
> >> because:
> >>
> >>   - FIFO is complex to specify: for example Dataflow has "per stage 
> >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus 
> >> "stage" is not a portable concept, nor even guaranteed to remain a 
> >> Dataflow concept)
> >>   - complex specifications are by definition poor usability (if necessary, 
> >> then it is what it is)
> >>   - overly restricts the runner, reduces parallelism, for example any 
> >> non-stateful ParDo has per-element parallelism, not per "key"
> >>   - another perspective on that: FIFO makes everyone pay rather than just 
> >> the transform that requires exactly sequencing
> >>   - previous implementation details like reshuffles become part of the 
> >> model
> >>   - I'm not even convinced the use cases involved are addressed by some 
> >> careful FIFO restrictions; many sinks re-key and they would all have to 
> >> become aware of how keying of a sequence of "stages" affects the 
> >> end-to-end FIFO
> >>
> >> A noop becoming a non-noop is essentially the mathematical definition of 
> >> moving from higher-level to lower-level abstraction.
> >>
> >> So this strikes at the core question of what level of abstraction Beam 
> >> aims to represent. Lower-level means there are fewer possible 
> >> implementations and it is more tied to the underlying architecture, and 
> >> anything not near-exact match pays a huge penalty. Higher-level means 
> >> there are more implementations possible with different tradeoffs, though 
> >> they may all pay a minor penalty.
> >>
> >> I could be convinced to change my mind, but it needs some extensive 
> >> design, examples, etc. I think it is probably about the most consequential 
> >> design decision in the whole Beam model, around the same level as the 
> >> decision to use ParDo and GBK as the primitives IMO.
> >>
> >> Kenn


Re: Shuffling on apache beam

2019-05-29 Thread pasquale . bonito
Hi Reza,
with GlobalWindow with triggering I was able to reduce hotspot issues gaining 
satisfying performance for BigTable update. Unfortunately latency when getting 
messages from PubSub remains around 1.5s that it's too much considering our NFR.

This is the code I use to get the messages:
PCollectionTuple rawTransactions = p //
.apply("GetMessages",

PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME)

.withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic))
.apply(Window.configure()
.triggering(Repeatedly
.forever(AfterWatermark.pastEndOfWindow()
.withEarlyFirings(
AfterProcessingTime

.pastFirstElementInPane()

.plusDelayOf(Duration.millis(1)))
// Fire on any late 
data

.withLateFirings(AfterPane.elementCountAtLeast(1
.discardingFiredPanes())

Messages are produced with a different dataflow:
 Pipeline p = Pipeline.create(options);
p.apply(
"ReadFile",
TextIO.read()
.from(options.getInputLocation() + "/*.csv")
.watchForNewFiles(
// Check for new files every 1 seconds
Duration.millis(600),
// Never stop checking for new files
Watch.Growth.never()))
.apply(
"create message",
ParDo.of(
new DoFn() {
  @ProcessElement
  public void processElement(ProcessContext context) {
String line = context.element();

String payload = convertRow(line);
long now = 
LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli();
context.output(
new PubsubMessage(
payload.getBytes(), 
ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME, 
payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME, 
Long.toString(now;
  }
}))
.apply("publish message", PubsubIO.writeMessages().to(topic));

I'm uploading a file containing 100 rows every 600 ms.

I found different threads on satckoverflow around this latency issue, but none 
has a solution.


 

On 2019/05/24 07:19:02, Reza Rokni  wrote: 
> PS You can also make use of the GlobalWindow with a stateful DoFn.
> 
> On Fri, 24 May 2019 at 15:13, Reza Rokni  wrote:
> 
> > Hi,
> >
> > Have you explored the use of triggers with your use case?
> >
> >
> > https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html
> >
> > Cheers
> >
> > Reza
> >
> > On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com <
> > pasquale.bon...@gmail.com> wrote:
> >
> >> Hi Reuven,
> >> I would like to know if is possible to guarantee that record are
> >> processed by the same thread/task based on a key, as probably happens in a
> >> combine/stateful operation, without adding the delay of a windows.
> >> This could increase efficiency of caching and reduce same racing
> >> condition when writing data.
> >> I understand that workers are not part of programming model so I would
> >> like to know if it's possible to achieve this behaviour reducing at minimum
> >> the delay of windowing. We don't need any combine or state we just want the
> >> all record with a given key are sent to same thread,
> >>
> >> Thanks
> >>
> >>
> >> On 2019/05/24 03:20:13, Reuven Lax  wrote:
> >> > Can you explain what you mean by worker? While every runner has workers
> >> of
> >> > course, workers are not part of the programming model.
> >> >
> >> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com <
> >> > pasquale.bon...@gmail.com> wrote:
> >> >
> >> > > Hi all,
> >> > > I would like to know if Apache Beam has a functionality similar to
> >> > > fieldsGrouping in Storm that allows to send records to a specific
> >> > > task/worker based on a key.
> >> > > I know that we can achieve that with a combine/grouByKey operation but
> >> > > that implies to add a windowing in our pipeline that we don't want.
> >> > > I have also tried using a stateful transformation.
> >> > > I think that also in that case we should use a windowing, but I see
> >> that a
> >> > > job with a stateful ParDo operation can be submitted  on Google
> >> Dataflow
> >> 

Re: Timer support in Flink

2019-05-29 Thread Reza Rokni
Thanx Max!

Reza

On Wed, 29 May 2019, 16:38 Maximilian Michels,  wrote:

> Hi Reza,
>
> The detailed view of the capability matrix states: "The Flink Runner
> supports timers in non-merging windows."
>
> That is still the case. Other than that, timers should be working fine.
>
> > It makes very heavy use of Event.Time timers and has to do some manual
> DoFn cache work to get around some O(heavy) issues.
>
> If you are running on Flink 1.5, timer deletion suffers from O(n)
> complexity which has been fixed in newer versions.
>
> Cheers,
> Max
>
> On 29.05.19 03:27, Reza Rokni wrote:
> > Hi Flink experts,
> >
> > I am getting ready to push a PR around a utility class for
> timeseries join
> >
> > left.timestamp match to closest right.timestamp where right.timestamp <=
> > left.timestamp.
> >
> > It makes very heavy use of Event.Time timers and has to do some manual
> > DoFn cache work to get around some O(heavy) issues. Wanted to test
> > things against Flink: In the capability matrix we have "~" for Timer
> > support in Flink:
> >
> > https://beam.apache.org/documentation/runners/capability-matrix/
> >
> > Is that page outdated, if not what are the areas that still need to be
> > addressed please?
> >
> > Cheers
> >
> > Reza
> >
> >
> > --
> >
> > This email may be confidential and privileged. If you received this
> > communication by mistake, please don't forward it to anyone else, please
> > erase all copies and attachments, and please let me know that it has
> > gone to the wrong person.
> >
> > The above terms reflect a potential business arrangement, are provided
> > solely as a basis for further discussion, and are not intended to be and
> > do not constitute a legally binding obligation. No legally binding
> > obligations will be created, implied, or inferred until an agreement in
> > final form is executed in writing by all parties involved.
> >
>


Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Robert Bradshaw
Reformatting to 4 spaces seems a non-starter to me, as it would change
nearly every single line in the codebase (and the loss of all context as
well as that particular line).

This is probably why the 2-space fork exists. However, we don't conform to
that either--we use 2 spaces for indentation, but 4 for continuation
indentation. (As for the history of this, this goes back to Google's
internal style guide, probably motivated by consistency with C++, Java, ...
and the fact that with an indent level of 4 one ends up wrapping lines
quite frequently (it's telling that black's default line length is 88)).
This turns out to be an easy change to the codebase.

Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool
introduces a huge amount of vertical whitespace (e.g. closing parentheses
on their own line), e.g.

def foo(
args
):
  if (
  long expression)
  ):
func(
args
)

I wrote a simple post-processor to put closing parentheses on the same
lines, as well as omit the newline after "if (", and disabling formatting
of strings, which reduce the churn in our codebase to 15k lines (adding
about 4k) out of 200k total.

https://github.com/apache/beam/pull/8712/files

It's still very opinionated, often in different ways then me, and doesn't
understand the semantics of the code, but possibly something we could live
with given the huge advantages of an autoformatter.

An intermediate point would be to allow, but not require, autoformatting of
changed lines.

As for being beta quality, it looks like it's got a decent number of
contributors and in my book being in the python github project is a strong
positive signal. But, due to the above issues, I think we'd have to
maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue
is a 2-line change, and the rest implemented as a post-processing step (for
now, incomplete), so it'd be easy to stay in sync with upstream.)

On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía  wrote:
>
> > I think the question is if it can be configured in a way to fit our
> > current linter's style. I don't think it is feasible to reformat the
> > entire Python SDK.
>
> It cannot be configured to do what we actually do because Black is
> configurable only to support the standard python codestyle guidelines
> (PEP-8) which recommends 4 spaces and is what most projects in the
> python world use.
>
> > Reformatted lines don't allow quick access to the Git history. This
> > effect is still visible in the Java SDK. However, I have the feeling
> > that this might be less of a problem with Python because the linter has
> > more rules than Checkstyle had.
>
> Yes that’s the bad side effect but there are always tradeoffs we have
> to deal with.
>
>
>
>
> On Wed, May 29, 2019 at 10:52 AM Maximilian Michels 
wrote:
> >
> > I think the question is if it can be configured in a way to fit our
> > current linter's style. I don't think it is feasible to reformat the
> > entire Python SDK.
> >
> > Reformatted lines don't allow quick access to the Git history. This
> > effect is still visible in the Java SDK. However, I have the feeling
> > that this might be less of a problem with Python because the linter has
> > more rules than Checkstyle had.
> >
> > -Max
> >
> > On 29.05.19 10:16, Ismaël Mejía wrote:
> > >> My concerns are:
> > >> - The product is clearly marked as beta with a big warning.
> > >> - It looks like mostly a single person project. For the same reason
I also strongly prefer not using a fork for a specific setting. Fork will
only have less people looking at it.
> > >
> > > I suppose the project is marked as beta because it is recent, it was
> > > presented in 2018’s pycon, and because some things can change since
> > > auto-formatters are pretty tricky beasts, I think beta in that case is
> > > like our own ‘@Experimental’. If you look at the contribution page [1]
> > > you can notice that it is less and less a single person project, there
> > > have been 93 independent contributions since the project became
> > > public, and the fact that it is hosted in the python organization
> > > github [2] gives some confidence on the project continuity.
> > >
> > > You are right however about the fact that the main author seems to be
> > > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem
> > > arbitrary, but he is just following pep8 style guide recommendations
> > > [3]. I am curious of why we (Beam) do not follow the 4 spaces
> > > recommendation of PEP-8 or even Google's own Python style guide [4],
> > > So, probably it should be to us to reconsider the current policy to
> > > adapt to the standards (and the tool).
> > >
> > > I did a quick run of black with python 2.7 compatibility on
> > > sdks/python and got only 4 parsing errors which is positive given the
> > > size of our code base.
> > >
> > > 415 files reformatted, 45 files left unchanged, 4 files failed to
reformat.
> > >
> > > error: cannot format
> > >

Re: Question about building Go SDK

2019-05-29 Thread Kengo Seki
It was my fault, probably my gradle cache was broken.
Running gradlew succeeded after removing ~/.gradle.
I'm sorry for bothering you.

Kengo Seki 

On Wed, May 29, 2019 at 6:45 PM Kengo Seki  wrote:
>
> Hi Beam developers,
>
> I tried to build Go SDK on the master branch, but encountered the
> following error.
>
> ```
> $ ./gradlew :sdks:go:resolveBuildDependencies
>
> (snip)
>
> FAILURE: Build failed with an exception.
>
> * What went wrong:
> Execution failed for task ':sdks:go:resolveBuildDependencies'.
> > Exception in resolution, message is:
>   Cannot resolve dependency:github.com/coreos/etcd:
> commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921',
> urls=[https://github.com/co
> reos/etcd.git, g...@github.com:coreos/etcd.git]
>   Resolution stack is:
>   +- github.com/apache/beam/sdks/go
> ```
>
> Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in
> gogradle.lock seems to work.
>
> ```
> $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock
> $ ./gradlew :sdks:go:resolveBuildDependencies
>
> (snip)
>
> BUILD SUCCESSFUL in 4s
> 2 actionable tasks: 2 executed
> ```
>
> But I'm not sure whether it's a valid solution, since I'm not familiar
> with Gogradle and it seems that gogradle.lock shouldn't be edited
> directly according to its header comment.
> Is the above approach valid, or is there a more proper way to solve
> this problem...?
>
> Kengo Seki 


Question about building Go SDK

2019-05-29 Thread Kengo Seki
Hi Beam developers,

I tried to build Go SDK on the master branch, but encountered the
following error.

```
$ ./gradlew :sdks:go:resolveBuildDependencies

(snip)

FAILURE: Build failed with an exception.

* What went wrong:
Execution failed for task ':sdks:go:resolveBuildDependencies'.
> Exception in resolution, message is:
  Cannot resolve dependency:github.com/coreos/etcd:
commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921',
urls=[https://github.com/co
reos/etcd.git, g...@github.com:coreos/etcd.git]
  Resolution stack is:
  +- github.com/apache/beam/sdks/go
```

Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in
gogradle.lock seems to work.

```
$ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock
$ ./gradlew :sdks:go:resolveBuildDependencies

(snip)

BUILD SUCCESSFUL in 4s
2 actionable tasks: 2 executed
```

But I'm not sure whether it's a valid solution, since I'm not familiar
with Gogradle and it seems that gogradle.lock shouldn't be edited
directly according to its header comment.
Is the above approach valid, or is there a more proper way to solve
this problem...?

Kengo Seki 


Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Ismaël Mejía
> I think the question is if it can be configured in a way to fit our
> current linter's style. I don't think it is feasible to reformat the
> entire Python SDK.

It cannot be configured to do what we actually do because Black is
configurable only to support the standard python codestyle guidelines
(PEP-8) which recommends 4 spaces and is what most projects in the
python world use.

> Reformatted lines don't allow quick access to the Git history. This
> effect is still visible in the Java SDK. However, I have the feeling
> that this might be less of a problem with Python because the linter has
> more rules than Checkstyle had.

Yes that’s the bad side effect but there are always tradeoffs we have
to deal with.




On Wed, May 29, 2019 at 10:52 AM Maximilian Michels  wrote:
>
> I think the question is if it can be configured in a way to fit our
> current linter's style. I don't think it is feasible to reformat the
> entire Python SDK.
>
> Reformatted lines don't allow quick access to the Git history. This
> effect is still visible in the Java SDK. However, I have the feeling
> that this might be less of a problem with Python because the linter has
> more rules than Checkstyle had.
>
> -Max
>
> On 29.05.19 10:16, Ismaël Mejía wrote:
> >> My concerns are:
> >> - The product is clearly marked as beta with a big warning.
> >> - It looks like mostly a single person project. For the same reason I also 
> >> strongly prefer not using a fork for a specific setting. Fork will only 
> >> have less people looking at it.
> >
> > I suppose the project is marked as beta because it is recent, it was
> > presented in 2018’s pycon, and because some things can change since
> > auto-formatters are pretty tricky beasts, I think beta in that case is
> > like our own ‘@Experimental’. If you look at the contribution page [1]
> > you can notice that it is less and less a single person project, there
> > have been 93 independent contributions since the project became
> > public, and the fact that it is hosted in the python organization
> > github [2] gives some confidence on the project continuity.
> >
> > You are right however about the fact that the main author seems to be
> > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem
> > arbitrary, but he is just following pep8 style guide recommendations
> > [3]. I am curious of why we (Beam) do not follow the 4 spaces
> > recommendation of PEP-8 or even Google's own Python style guide [4],
> > So, probably it should be to us to reconsider the current policy to
> > adapt to the standards (and the tool).
> >
> > I did a quick run of black with python 2.7 compatibility on
> > sdks/python and got only 4 parsing errors which is positive given the
> > size of our code base.
> >
> > 415 files reformatted, 45 files left unchanged, 4 files failed to reformat.
> >
> > error: cannot format
> > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py:
> > Cannot parse: 47:22:   _display_progress = print
> > error: cannot format
> > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py:
> > Cannot parse: 151:18:   file=sys.stderr)
> > error: cannot format
> > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py:
> > Cannot parse: 160:34:   print(traceback_string, file=sys.stderr)
> > error: cannot format
> > /home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py:
> > Cannot parse: 335:51:   print('-->' if pc == last_pc else '',
> > end=' ')
> >
> > I still think this can be positive for the project but well I am
> > barely a contributor to the python code base so I let you the python
> > maintainers to reconsider this, in any case it seems like a good
> > improvement for the project.
> >
> > [1] https://github.com/python/black/graphs/contributors
> > [2] https://github.com/python
> > [3] https://www.python.org/dev/peps/pep-0008/#indentation
> > [4] 
> > https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation
> >
> > On Tue, May 28, 2019 at 11:15 PM Ahmet Altay  wrote:
> >>
> >> I am in the same boat with Robert, I am in favor of autoformatters but I 
> >> am not familiar with this one. My concerns are:
> >> - The product is clearly marked as beta with a big warning.
> >> - It looks like mostly a single person project. For the same reason I also 
> >> strongly prefer not using a fork for a specific setting. Fork will only 
> >> have less people looking at it.
> >>
> >> IMO, this is in an early stage for us. That said lint issues are real as 
> >> pointed in the thread. If someone would like to give it a try and see how 
> >> it would look like for us that would be interesting.
> >>
> >> On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk 
> >>  wrote:
> >>>
> >>> This sounds really good. A lot of Jenkins jobs failures are caused by 
> >>> lint problems.
> >>> I think it would be great to have something similar to 

Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Maximilian Michels
I think the question is if it can be configured in a way to fit our 
current linter's style. I don't think it is feasible to reformat the 
entire Python SDK.


Reformatted lines don't allow quick access to the Git history. This 
effect is still visible in the Java SDK. However, I have the feeling 
that this might be less of a problem with Python because the linter has 
more rules than Checkstyle had.


-Max

On 29.05.19 10:16, Ismaël Mejía wrote:

My concerns are:
- The product is clearly marked as beta with a big warning.
- It looks like mostly a single person project. For the same reason I also 
strongly prefer not using a fork for a specific setting. Fork will only have 
less people looking at it.


I suppose the project is marked as beta because it is recent, it was
presented in 2018’s pycon, and because some things can change since
auto-formatters are pretty tricky beasts, I think beta in that case is
like our own ‘@Experimental’. If you look at the contribution page [1]
you can notice that it is less and less a single person project, there
have been 93 independent contributions since the project became
public, and the fact that it is hosted in the python organization
github [2] gives some confidence on the project continuity.

You are right however about the fact that the main author seems to be
the ‘benevolent’ dictator, and in the 2-spaces issue he can seem
arbitrary, but he is just following pep8 style guide recommendations
[3]. I am curious of why we (Beam) do not follow the 4 spaces
recommendation of PEP-8 or even Google's own Python style guide [4],
So, probably it should be to us to reconsider the current policy to
adapt to the standards (and the tool).

I did a quick run of black with python 2.7 compatibility on
sdks/python and got only 4 parsing errors which is positive given the
size of our code base.

415 files reformatted, 45 files left unchanged, 4 files failed to reformat.

error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py:
Cannot parse: 47:22:   _display_progress = print
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py:
Cannot parse: 151:18:   file=sys.stderr)
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py:
Cannot parse: 160:34:   print(traceback_string, file=sys.stderr)
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py:
Cannot parse: 335:51:   print('-->' if pc == last_pc else '',
end=' ')

I still think this can be positive for the project but well I am
barely a contributor to the python code base so I let you the python
maintainers to reconsider this, in any case it seems like a good
improvement for the project.

[1] https://github.com/python/black/graphs/contributors
[2] https://github.com/python
[3] https://www.python.org/dev/peps/pep-0008/#indentation
[4] https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation

On Tue, May 28, 2019 at 11:15 PM Ahmet Altay  wrote:


I am in the same boat with Robert, I am in favor of autoformatters but I am not 
familiar with this one. My concerns are:
- The product is clearly marked as beta with a big warning.
- It looks like mostly a single person project. For the same reason I also 
strongly prefer not using a fork for a specific setting. Fork will only have 
less people looking at it.

IMO, this is in an early stage for us. That said lint issues are real as 
pointed in the thread. If someone would like to give it a try and see how it 
would look like for us that would be interesting.

On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk  
wrote:


This sounds really good. A lot of Jenkins jobs failures are caused by lint 
problems.
I think it would be great to have something similar to Spotless in Java SDK (I 
heard there is problem with configuring Black with IntelliJ).

On Mon, May 27, 2019 at 10:52 PM Robert Bradshaw  wrote:


I'm generally in favor of autoformatters, though I haven't looked at
how well this particular one works. We might have to go with
https://github.com/desbma/black-2spaces given
https://github.com/python/black/issues/378 .

On Mon, May 27, 2019 at 10:43 PM Pablo Estrada  wrote:


This looks pretty good:) I know at least a couple people (myself included) 
who've been annoyed by having to take care of lint issues that maybe a code 
formatter could save us.
Thanks for sharing Ismael.
-P.


On Mon, May 27, 2019, 12:24 PM Ismaël Mejía  wrote:


I stumbled by chance into Black [1] a python code auto formatter that
is becoming the 'de-facto' auto-formatter for python, and wanted to
bring to the ML Is there interest from the python people to get this
into the build?

The introduction of spotless for Java has been a good improvement and
maybe the python code base may benefit of this too.

WDYT?

[1] https://github.com/python/black


Re: Timer support in Flink

2019-05-29 Thread Maximilian Michels

Hi Reza,

The detailed view of the capability matrix states: "The Flink Runner 
supports timers in non-merging windows."


That is still the case. Other than that, timers should be working fine.


It makes very heavy use of Event.Time timers and has to do some manual DoFn 
cache work to get around some O(heavy) issues.


If you are running on Flink 1.5, timer deletion suffers from O(n) 
complexity which has been fixed in newer versions.


Cheers,
Max

On 29.05.19 03:27, Reza Rokni wrote:

Hi Flink experts,

I am getting ready to push a PR around a utility class for timeseries join

left.timestamp match to closest right.timestamp where right.timestamp <= 
left.timestamp.


It makes very heavy use of Event.Time timers and has to do some manual 
DoFn cache work to get around some O(heavy) issues. Wanted to test 
things against Flink: In the capability matrix we have "~" for Timer 
support in Flink:


https://beam.apache.org/documentation/runners/capability-matrix/

Is that page outdated, if not what are the areas that still need to be 
addressed please?


Cheers

Reza


--

This email may be confidential and privileged. If you received this 
communication by mistake, please don't forward it to anyone else, please 
erase all copies and attachments, and please let me know that it has 
gone to the wrong person.


The above terms reflect a potential business arrangement, are provided 
solely as a basis for further discussion, and are not intended to be and 
do not constitute a legally binding obligation. No legally binding 
obligations will be created, implied, or inferred until an agreement in 
final form is executed in writing by all parties involved.




Re: [DISCUSS] Autoformat python code with Black

2019-05-29 Thread Ismaël Mejía
> My concerns are:
> - The product is clearly marked as beta with a big warning.
> - It looks like mostly a single person project. For the same reason I also 
> strongly prefer not using a fork for a specific setting. Fork will only have 
> less people looking at it.

I suppose the project is marked as beta because it is recent, it was
presented in 2018’s pycon, and because some things can change since
auto-formatters are pretty tricky beasts, I think beta in that case is
like our own ‘@Experimental’. If you look at the contribution page [1]
you can notice that it is less and less a single person project, there
have been 93 independent contributions since the project became
public, and the fact that it is hosted in the python organization
github [2] gives some confidence on the project continuity.

You are right however about the fact that the main author seems to be
the ‘benevolent’ dictator, and in the 2-spaces issue he can seem
arbitrary, but he is just following pep8 style guide recommendations
[3]. I am curious of why we (Beam) do not follow the 4 spaces
recommendation of PEP-8 or even Google's own Python style guide [4],
So, probably it should be to us to reconsider the current policy to
adapt to the standards (and the tool).

I did a quick run of black with python 2.7 compatibility on
sdks/python and got only 4 parsing errors which is positive given the
size of our code base.

415 files reformatted, 45 files left unchanged, 4 files failed to reformat.

error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py:
Cannot parse: 47:22:   _display_progress = print
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py:
Cannot parse: 151:18:   file=sys.stderr)
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py:
Cannot parse: 160:34:   print(traceback_string, file=sys.stderr)
error: cannot format
/home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py:
Cannot parse: 335:51:   print('-->' if pc == last_pc else '',
end=' ')

I still think this can be positive for the project but well I am
barely a contributor to the python code base so I let you the python
maintainers to reconsider this, in any case it seems like a good
improvement for the project.

[1] https://github.com/python/black/graphs/contributors
[2] https://github.com/python
[3] https://www.python.org/dev/peps/pep-0008/#indentation
[4] https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation

On Tue, May 28, 2019 at 11:15 PM Ahmet Altay  wrote:
>
> I am in the same boat with Robert, I am in favor of autoformatters but I am 
> not familiar with this one. My concerns are:
> - The product is clearly marked as beta with a big warning.
> - It looks like mostly a single person project. For the same reason I also 
> strongly prefer not using a fork for a specific setting. Fork will only have 
> less people looking at it.
>
> IMO, this is in an early stage for us. That said lint issues are real as 
> pointed in the thread. If someone would like to give it a try and see how it 
> would look like for us that would be interesting.
>
> On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk 
>  wrote:
>>
>> This sounds really good. A lot of Jenkins jobs failures are caused by lint 
>> problems.
>> I think it would be great to have something similar to Spotless in Java SDK 
>> (I heard there is problem with configuring Black with IntelliJ).
>>
>> On Mon, May 27, 2019 at 10:52 PM Robert Bradshaw  wrote:
>>>
>>> I'm generally in favor of autoformatters, though I haven't looked at
>>> how well this particular one works. We might have to go with
>>> https://github.com/desbma/black-2spaces given
>>> https://github.com/python/black/issues/378 .
>>>
>>> On Mon, May 27, 2019 at 10:43 PM Pablo Estrada  wrote:
>>> >
>>> > This looks pretty good:) I know at least a couple people (myself 
>>> > included) who've been annoyed by having to take care of lint issues that 
>>> > maybe a code formatter could save us.
>>> > Thanks for sharing Ismael.
>>> > -P.
>>> >
>>> >
>>> > On Mon, May 27, 2019, 12:24 PM Ismaël Mejía  wrote:
>>> >>
>>> >> I stumbled by chance into Black [1] a python code auto formatter that
>>> >> is becoming the 'de-facto' auto-formatter for python, and wanted to
>>> >> bring to the ML Is there interest from the python people to get this
>>> >> into the build?
>>> >>
>>> >> The introduction of spotless for Java has been a good improvement and
>>> >> maybe the python code base may benefit of this too.
>>> >>
>>> >> WDYT?
>>> >>
>>> >> [1] https://github.com/python/black