Re: [DISCUSS] Autoformat python code with Black
+1 pending good enough tooling (I can't quite tell - seems there are some issues?) On Wed, May 29, 2019 at 2:40 PM Katarzyna Kucharczyk < ka.kucharc...@gmail.com> wrote: > What else actually we gain? My guess is faster PR review iteration. We > will skip some of conversations about code style. > ... > Last but not least, new contributor may be less discouraged. When I > started contribute I didn’t know how to format my code and I lost a lot of > time to add pylint and adjust IntelliJ. I eventually failed. Currently I > write code intuitively and when I don’t forget I rerun tox. > This is a huge benefit. This is why I supported it so much for Java. It is a community benefit. You do not have to be a contributor to the Python SDK to support this. That is why I am writing here. Just eliminate all discussion of formatting. It doesn't really matter what the resulting format is, if it is not crazy to read. I strongly oppose maintaining a non-default format. Reformating 20k lines or 200k is not hard. The Java global reformat touched 50k lines. It does not really matter how big it is. Definitely do it all at once if you think the tool is good enough. And you should pin a version, so churn is not a problem. You can upgrade the version and reformat in a PR later and that is also easy. It is a GitHub UI issue that you cannot easily browse past the reformat. It is not actually that hard, but does take a couple extra clicks to get GitHub to display blame before a reformat. It is easier with the command line. I do a lot of code history digging and the global Java reformat is not really a problem. Kenn > Also everything will be formatted in a same way, so eventually it would be > easier to read. > > Moreover, as it was mentioned in previous emails - a lot of Jenkins > failures won’t take place, so we save time and resources. > > > One of disadvantages is that our pipelines has custom syntax and after > formatting they looks a little bit weird, but maybe extending the only > configurable option in Black - lines, from 88 to 110 would be solution. > > Second one is that Black requires Python 3 to be run. I don’t know how big > obstacle it would be. > > I believe there are two options how it would be possible to introduce > Black. First: just do it, it will hurt but then it would be ok (same as a > dentist appointment). Of course it may require some work to adjust linters. > On the other hand we can do it gradually and start including sdk parts one > by one - maybe it will be less painful? > > As an example I can share one of projects [2] I know that uses Black (they > use also other cool checkers and pre-commit [3]). This is how looks their > build with all checks [4]. > > To sum up I believe that if we want improve our coding experience, we > should improve our toolset. Black seems be recent and quite popular tool > what makes think they won’t stop developing it. > > [1] > https://stackoverflow.com/questions/4112410/git-change-styling-whitespace-without-changing-ownership-blame > > [2] https://github.com/GoogleCloudPlatform/oozie-to-airflow > > [3] https://pre-commit.com > > [4] > https://travis-ci.org/GoogleCloudPlatform/oozie-to-airflow/builds/538725689 > > > On Wed, May 29, 2019 at 2:01 PM Robert Bradshaw > wrote: > >> Reformatting to 4 spaces seems a non-starter to me, as it would change >> nearly every single line in the codebase (and the loss of all context as >> well as that particular line). >> >> This is probably why the 2-space fork exists. However, we don't conform >> to that either--we use 2 spaces for indentation, but 4 for continuation >> indentation. (As for the history of this, this goes back to Google's >> internal style guide, probably motivated by consistency with C++, Java, ... >> and the fact that with an indent level of 4 one ends up wrapping lines >> quite frequently (it's telling that black's default line length is 88)). >> This turns out to be an easy change to the codebase. >> >> Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool >> introduces a huge amount of vertical whitespace (e.g. closing parentheses >> on their own line), e.g. >> >> def foo( >> args >> ): >> if ( >> long expression) >> ): >> func( >> args >> ) >> >> I wrote a simple post-processor to put closing parentheses on the same >> lines, as well as omit the newline after "if (", and disabling formatting >> of strings, which reduce the churn in our codebase to 15k lines (adding >> about 4k) out of 200k total. >> >> https://github.com/apache/beam/pull/8712/files >> >> It's still very opinionated, often in different ways then me, and doesn't >> understand the semantics of the code, but possibly something we could live >> with given the huge advantages of an autoformatter. >> >> An intermediate point would be to allow, but not require, autoformatting >> of changed lines. >> >> As for being beta quality, it looks like it's got a decent number of >> contributors and in my book
**Request to add me as a contributor.**
Hello everyone, My name is Akshay Iyangar, using beam repo extensively. There is a small patch that I would like to push through upstream. https://issues.apache.org/jira/browse/BEAM-7442 . I’m working on this issue and hope to become a contributor for Beam's JIRA issue tracker so that I can assign it to myself. My JIRA id is - aiyangar Thanks, Akshay Iyangar
Re: [DISCUSS] Cookbooks for users with knowledge in other frameworks
+1 I think there will be at least two layers of this; Layer 1 - Using primitives : I do join, GBK, Aggregation... with system x this way, what is the canonical equivalent in Beam. Layer 2 - Patterns : I read and join Unbounded and Bounded Data in system x this way, what is the canonical equivalent in Beam. I suspect as a first pass Layer 1 is reasonably well bounded work, there would need to be agreement on "canonical" version of how to do something in Beam as this could be seen to be opinionated. As there are often a multitude of ways of doing x On Thu, 30 May 2019 at 08:56, Ahmet Altay wrote: > Hi all, > > Inspired by the user asking about a Spark feature in Beam [1] in the > release thread, I searched the user@ list and noticed a few instances of > people asking for question like "I can do X in Spark, how can I do that in > Beam?" Would it make sense to add documentation to explain how certain > tasks that can be accomplished in Beam with side by side examples of doing > the same task in Beam/Spark etc. It could help with on-boarding because it > will be easier for people to leverage their existing knowledge. It could > also help other frameworks as well, because it will serve as a Rosetta > stone with two translations. > > Questions I have are: > - Would such a thing be a helpful? > - Is it feasible? Would a few pages worth of examples can cover enough use > cases? > > Thank you! > Ahmet > > [1] > https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E > -- This email may be confidential and privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person. The above terms reflect a potential business arrangement, are provided solely as a basis for further discussion, and are not intended to be and do not constitute a legally binding obligation. No legally binding obligations will be created, implied, or inferred until an agreement in final form is executed in writing by all parties involved.
[DISCUSS] Cookbooks for users with knowledge in other frameworks
Hi all, Inspired by the user asking about a Spark feature in Beam [1] in the release thread, I searched the user@ list and noticed a few instances of people asking for question like "I can do X in Spark, how can I do that in Beam?" Would it make sense to add documentation to explain how certain tasks that can be accomplished in Beam with side by side examples of doing the same task in Beam/Spark etc. It could help with on-boarding because it will be easier for people to leverage their existing knowledge. It could also help other frameworks as well, because it will serve as a Rosetta stone with two translations. Questions I have are: - Would such a thing be a helpful? - Is it feasible? Would a few pages worth of examples can cover enough use cases? Thank you! Ahmet [1] https://lists.apache.org/thread.html/b73a54aa1e6e9933628f177b04a8f907c26cac854745fa081c478eff@%3Cdev.beam.apache.org%3E
Re: Proposal: Portability SDKHarness Docker Image Release with Beam Version Release.
I agree, I think their are few things which have to be though through as part of Portable image release. * Where to host the images. We can ofcourse have an alias for the image which can point to a different location but the hosting location have to be sort through. * Validation process for the images. * Backward compatibility for the images. Though we can just tag them with release name. I might not have immediate bandwidth to do this so we need to prioritize based on other items we have. On Tue, May 28, 2019 at 5:24 PM Ahmet Altay wrote: > Could we first figure out the process (where to push, how to push, > permissions needed, how to validate etc.) as part of the snapshots and > update the release guide based on that? > > On Tue, May 28, 2019 at 2:43 AM Robert Bradshaw > wrote: > >> In the future (read, next release) the SDK will likely have reference >> to the containers, so this will have to be part of the release. > > > Who is working on this change? Could they help with figuring out the > publishing the containers part? > > >> But I >> agree for 2.13 it should be more about figuring out the process and >> not necessarily holding back. >> >> On Mon, May 27, 2019 at 7:42 PM Ankur Goenka wrote: >> > >> > +1 >> > We can release the images with 2.13 but we should not block 2.13 >> release for this. >> > >> > On Mon, May 27, 2019, 8:39 AM Thomas Weise wrote: >> >> >> >> +1 >> >> >> >> >> >> On Mon, May 27, 2019 at 6:56 AM Ismaël Mejía >> wrote: >> >>> >> >>> +1 >> >>> >> >>> On Mon, May 27, 2019 at 3:35 PM Maximilian Michels >> wrote: >> >>> > >> >>> > +1 >> >>> > >> >>> > On 27.05.19 14:04, Robert Bradshaw wrote: >> >>> > > Sounds like everyone's onboard with the plan. Any chance we could >> >>> > > publish these for the upcoming 2.13 release? >> >>> > > >> >>> > > On Wed, Feb 6, 2019 at 6:29 PM Łukasz Gajowy >> wrote: >> >>> > >> >> >>> > >> +1 to have a registry for images accessible to anyone. For >> snapshot images, I agree that gcr + apache-beam-testing project seems a >> good and easy way to start with. >> >>> > >> >> >>> > >> Łukasz >> >>> > >> >> >>> > >> wt., 22 sty 2019 o 19:43 Mark Liu >> napisał(a): >> >>> > >>> >> >>> > >>> +1 to have an official Beam released container image. >> >>> > >>> >> >>> > >>> Also I would propose to add a verification step to (or after) >> the release process to do smoke check. Python have ValidatesContainer test >> that runs basic pipeline using newly built container for verification. >> Other sdk languages can do similar thing or add a common framework. >> >>> > >>> >> >>> > >>> Mark >> >>> > >>> >> >>> > >>> On Thu, Jan 17, 2019 at 5:56 AM Alan Myrvold < >> amyrv...@google.com> wrote: >> >>> > >> >>> > +1 This would be great. gcr.io seems like a good option for >> snapshots due to the permissions from jenkins to upload and ability to keep >> snapshots around. >> >>> > >> >>> > On Wed, Jan 16, 2019 at 6:51 PM Ruoyun Huang < >> ruo...@google.com> wrote: >> >>> > > >> >>> > > +1 This would be a great thing to have. >> >>> > > >> >>> > > On Wed, Jan 16, 2019 at 6:11 PM Ankur Goenka < >> goe...@google.com> wrote: >> >>> > >> >> >>> > >> grc.io seems to be a good option. Given that we don't need >> the hosting server name in the image name makes it easily changeable later. >> >>> > >> >> >>> > >> Docker container for Apache Flink is named "flink" and they >> have different tags for different releases and configurations >> https://hub.docker.com/_/flink .We can follow a similar model and can >> name the image as "beam" (beam doesn't seem to be taken on docker hub) and >> use tags to distinguish Java/Python/Go and versions etc. >> >>> > >> >> >>> > >> Tags will look like: >> >>> > >> java-SNAPSHOT >> >>> > >> java-2.10.1 >> >>> > >> python2-SNAPSHOT >> >>> > >> python2-2.10.1 >> >>> > >> go-SNAPSHOT >> >>> > >> go-2.10.1 >> >>> > >> >> >>> > >> >> >>> > >> On Wed, Jan 16, 2019 at 5:56 PM Ahmet Altay < >> al...@google.com> wrote: >> >>> > >>> >> >>> > >>> For snapshots, we could use gcr.io. Permission would not >> be a problem since Jenkins is already correctly setup. The cost will be >> covered under apache-beam-testing project. And since this is only for >> snapshots, it will be only for temporary artifacts not for release >> artifacts. >> >>> > >>> >> >>> > >>> On Wed, Jan 16, 2019 at 5:50 PM Valentyn Tymofieiev < >> valen...@google.com> wrote: >> >>> > >> >>> > +1, releasing containers is a useful process that we need >> to build in Beam and it is required for FnApi users. Among other reasons, >> having officially-released Beam SDK harness container images will make it >> easier for users to do simple customizations to container images, as they >> will be able to use container image released by Beam as a base image. >> >>> > >> >>> > Good point about potential storage limitations on Bintray. >> With Beam
Re: [VOTE] Release 2.13.0, release candidate #1
Thanks Valentyn for providing the fix for the BQ blocker. All other cherrypicks are also done mentioned in the mail thread. I will start RC2 now. Thanks, Ankur On Wed, May 29, 2019 at 2:47 PM Valentyn Tymofieiev wrote: > Hi Vasiullah, > > I am not aware of such function. I suggest that you start a new thread on > u...@beam.apache.org mailing list for this question and describe your > use-case there. > > On Wed, May 29, 2019 at 2:39 PM Vasiullah syed wrote: > >> Hello All, >> >> >> Can anyone please help me out with one inbuilt function >> which is there in apache spark with name as Monotonically increasing id is >> there any smilar kind in apache beam if so please revert it with more in >> detail thanks in advance >> >> On Tue, May 28, 2019 at 9:49 PM Valentyn Tymofieiev >> wrote: >> >>> -1. >>> I would like us to fix >>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a >>> regression that happened in 2.12.0, but was not caught by existing tests. >>> >>> Thanks, >>> Valentyn >>> >>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka wrote: >>> Hi everyone, Please review and vote on the release candidate #1 for the version 2.13.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release to be deployed to dist.apache.org [2], which is signed with the key with fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag "v2.13.0-RC1" [5], * website pull request listing the release [6] and publishing the API reference manual [7]. * Python artifacts are deployed along with the source release to the dist.apache.org [2]. * Validation sheet with a tab for 2.13.0 release to help with validation [8]. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Ankur [1] https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ [3] https://dist.apache.org/repos/dist/release/beam/KEYS [4] https://repository.apache.org/content/repositories/orgapachebeam-1069/ [5] https://github.com/apache/beam/tree/v2.13.0-RC1 [6] https://github.com/apache/beam/pull/8645 [7] https://github.com/apache/beam-site/pull/589 [8] https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952 >>>
Re: Definition of Unified model
> Offsets within a file, unordered between files seems exactly analogous with offsets within a partition, unordered between partitions, right? Not exactly. The key difference is in that partitions in streaming stores are defined (on purpose, and with key impact on this discussion) as unbounded sequence of appends. Files, on the other hand are always of finite size. This difference makes the semantics of offsets in partitioned stream useful, because the are guaranteed to only increase. On batch stores as files, these offsets would have to start from zero after some (finite) time, which makes them useless for comparison. On 5/29/19 2:44 PM, Robert Bradshaw wrote: On Tue, May 28, 2019 at 12:18 PM Jan Lukavský wrote: As I understood it, Kenn was supporting the idea that sequence metadata is preferable over FIFO. I was trying to point out, that it even should provide the same functionally as FIFO, plus one important more - reproducibility and ability to being persisted and reused the same way in batch and streaming. There is no doubt, that sequence metadata can be stored in every storage. But, regarding some implicit ordering that sources might have - yes, of course, data written into HDFS or Cloud Storage has ordering, but only partial - inside some bulk (e.g. file) and the ordering is not defined correctly on boundaries of these bulks (between files). That is why I'd say, that ordering of sources is relevant only for (partitioned!) streaming sources and generally always reduces to sequence metadata (e.g. offsets). Offsets within a file, unordered between files seems exactly analogous with offsets within a partition, unordered between partitions, right? On 5/28/19 11:43 AM, Robert Bradshaw wrote: Huge +1 to all Kenn said. Jan, batch sources can have orderings too, just like Kafka. I think it's reasonable (for both batch and streaming) that if a source has an ordering that is an important part of the data, it should preserve this ordering into the data itself (e.g. as sequence numbers, offsets, etc.) On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles wrote: I strongly prefer explicit sequence metadata over FIFO requirements, because: - FIFO is complex to specify: for example Dataflow has "per stage key-to-key" FIFO today, but it is not guaranteed to remain so (plus "stage" is not a portable concept, nor even guaranteed to remain a Dataflow concept) - complex specifications are by definition poor usability (if necessary, then it is what it is) - overly restricts the runner, reduces parallelism, for example any non-stateful ParDo has per-element parallelism, not per "key" - another perspective on that: FIFO makes everyone pay rather than just the transform that requires exactly sequencing - previous implementation details like reshuffles become part of the model - I'm not even convinced the use cases involved are addressed by some careful FIFO restrictions; many sinks re-key and they would all have to become aware of how keying of a sequence of "stages" affects the end-to-end FIFO A noop becoming a non-noop is essentially the mathematical definition of moving from higher-level to lower-level abstraction. So this strikes at the core question of what level of abstraction Beam aims to represent. Lower-level means there are fewer possible implementations and it is more tied to the underlying architecture, and anything not near-exact match pays a huge penalty. Higher-level means there are more implementations possible with different tradeoffs, though they may all pay a minor penalty. I could be convinced to change my mind, but it needs some extensive design, examples, etc. I think it is probably about the most consequential design decision in the whole Beam model, around the same level as the decision to use ParDo and GBK as the primitives IMO. Kenn
Re: Question about building Go SDK
Not a bother at all! The Gradle based build for the Go SDK is brittle, and there are a few issues around it. Notably that it's doesn't really line up with how end users will typically acquire dependencies, using the regular go toolchain. As Luke says, if you're working with or *on* the SDK, one can live almost entirely on the standard go toolchain. It's a goal to get the Gradle dependencies to be backed by Go Modules in the next few months. This will make sure that what beam is testing more closely match what users will depend on. If you're trying out the SDK, the main need for Gradle is to build an SDK image for yourself. I don't believe we publish common use worker images for the experimental SDK at this time. On Wed, May 29, 2019, 2:39 PM Lukasz Cwik wrote: > I have gotten the same error on other dependencies. Gogradle sometimes > fails to pull down the dependencies. Retrying ./gradlew > :sdks:go:resolveBuildDependencies in the past did allow me to fetch the > dependencies. Others who focus only on the Go SDK work using the standard > Go development process by checking out the apache beam package into their > GOPATH editing it there. I rarely do this because I typically focus on > changes that are either outside the Go SDK or span multiple languages. > > > On Wed, May 29, 2019 at 5:00 AM Kengo Seki wrote: > >> It was my fault, probably my gradle cache was broken. >> Running gradlew succeeded after removing ~/.gradle. >> I'm sorry for bothering you. >> >> Kengo Seki >> >> On Wed, May 29, 2019 at 6:45 PM Kengo Seki wrote: >> > >> > Hi Beam developers, >> > >> > I tried to build Go SDK on the master branch, but encountered the >> > following error. >> > >> > ``` >> > $ ./gradlew :sdks:go:resolveBuildDependencies >> > >> > (snip) >> > >> > FAILURE: Build failed with an exception. >> > >> > * What went wrong: >> > Execution failed for task ':sdks:go:resolveBuildDependencies'. >> > > Exception in resolution, message is: >> > Cannot resolve dependency:github.com/coreos/etcd: >> > commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921', >> > urls=[https://github.com/co >> > reos/etcd.git, g...@github.com:coreos/etcd.git] >> > Resolution stack is: >> > +- github.com/apache/beam/sdks/go >> > ``` >> > >> > Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in >> > gogradle.lock seems to work. >> > >> > ``` >> > $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock >> > $ ./gradlew :sdks:go:resolveBuildDependencies >> > >> > (snip) >> > >> > BUILD SUCCESSFUL in 4s >> > 2 actionable tasks: 2 executed >> > ``` >> > >> > But I'm not sure whether it's a valid solution, since I'm not familiar >> > with Gogradle and it seems that gogradle.lock shouldn't be edited >> > directly according to its header comment. >> > Is the above approach valid, or is there a more proper way to solve >> > this problem...? >> > >> > Kengo Seki >> >
Re: [VOTE] Release 2.13.0, release candidate #1
Hi Vasiullah, I am not aware of such function. I suggest that you start a new thread on u...@beam.apache.org mailing list for this question and describe your use-case there. On Wed, May 29, 2019 at 2:39 PM Vasiullah syed wrote: > Hello All, > > > Can anyone please help me out with one inbuilt function > which is there in apache spark with name as Monotonically increasing id is > there any smilar kind in apache beam if so please revert it with more in > detail thanks in advance > > On Tue, May 28, 2019 at 9:49 PM Valentyn Tymofieiev > wrote: > >> -1. >> I would like us to fix >> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It is a >> regression that happened in 2.12.0, but was not caught by existing tests. >> >> Thanks, >> Valentyn >> >> On Wed, May 22, 2019, 4:30 PM Ankur Goenka wrote: >> >>> Hi everyone, >>> >>> Please review and vote on the release candidate #1 for the version >>> 2.13.0, as follows: >>> >>> [ ] +1, Approve the release >>> [ ] -1, Do not approve the release (please provide specific comments) >>> >>> The complete staging area is available for your review, which includes: >>> * JIRA release notes [1], >>> * the official Apache source release to be deployed to dist.apache.org >>> [2], which is signed with the key with fingerprint >>> 6356C1A9F089B0FA3DE8753688934A6699985948 [3], >>> * all artifacts to be deployed to the Maven Central Repository [4], >>> * source code tag "v2.13.0-RC1" [5], >>> * website pull request listing the release [6] and publishing the API >>> reference manual [7]. >>> * Python artifacts are deployed along with the source release to the >>> dist.apache.org [2]. >>> * Validation sheet with a tab for 2.13.0 release to help with validation >>> [8]. >>> >>> The vote will be open for at least 72 hours. It is adopted by majority >>> approval, with at least 3 PMC affirmative votes. >>> >>> Thanks, >>> Ankur >>> >>> [1] >>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS >>> [4] >>> https://repository.apache.org/content/repositories/orgapachebeam-1069/ >>> [5] https://github.com/apache/beam/tree/v2.13.0-RC1 >>> [6] https://github.com/apache/beam/pull/8645 >>> [7] https://github.com/apache/beam-site/pull/589 >>> [8] >>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952 >>> >>
Re: [DISCUSS] Autoformat python code with Black
I think all of mentioned arguments are reasonable. As Ismaël said this all seems to be a tradeoff but maybe worth of considering. Maybe we should think why it’s worth to introduce a tool which will make us strictly use PEP8. First of all, it should help with good coding practices. For example if we will have 4 spaces and if someone repeatedly nests a lot of constructions, then this code will be more difficult to read. So the author (hopefully) will refactor it and maybe split into functions. Also most of rules would help keep our future commits more clear. For example by adding comma at the end of the list, adding new item would be change of 1 line instead of 2. I agree with Max, it’s understandable this may be a bit problematic in searching git history, but I guess this is same case as in Java SDK and spotless. Maybe there is somewhere solution to better filter that? After fast research I found this [1] (so maybe solution is in git?). What else actually we gain? My guess is faster PR review iteration. We will skip some of conversations about code style. Also everything will be formatted in a same way, so eventually it would be easier to read. Moreover, as it was mentioned in previous emails - a lot of Jenkins failures won’t take place, so we save time and resources. Last but not least, new contributor may be less discouraged. When I started contribute I didn’t know how to format my code and I lost a lot of time to add pylint and adjust IntelliJ. I eventually failed. Currently I write code intuitively and when I don’t forget I rerun tox. One of disadvantages is that our pipelines has custom syntax and after formatting they looks a little bit weird, but maybe extending the only configurable option in Black - lines, from 88 to 110 would be solution. Second one is that Black requires Python 3 to be run. I don’t know how big obstacle it would be. I believe there are two options how it would be possible to introduce Black. First: just do it, it will hurt but then it would be ok (same as a dentist appointment). Of course it may require some work to adjust linters. On the other hand we can do it gradually and start including sdk parts one by one - maybe it will be less painful? As an example I can share one of projects [2] I know that uses Black (they use also other cool checkers and pre-commit [3]). This is how looks their build with all checks [4]. To sum up I believe that if we want improve our coding experience, we should improve our toolset. Black seems be recent and quite popular tool what makes think they won’t stop developing it. [1] https://stackoverflow.com/questions/4112410/git-change-styling-whitespace-without-changing-ownership-blame [2] https://github.com/GoogleCloudPlatform/oozie-to-airflow [3] https://pre-commit.com [4] https://travis-ci.org/GoogleCloudPlatform/oozie-to-airflow/builds/538725689 On Wed, May 29, 2019 at 2:01 PM Robert Bradshaw wrote: > Reformatting to 4 spaces seems a non-starter to me, as it would change > nearly every single line in the codebase (and the loss of all context as > well as that particular line). > > This is probably why the 2-space fork exists. However, we don't conform to > that either--we use 2 spaces for indentation, but 4 for continuation > indentation. (As for the history of this, this goes back to Google's > internal style guide, probably motivated by consistency with C++, Java, ... > and the fact that with an indent level of 4 one ends up wrapping lines > quite frequently (it's telling that black's default line length is 88)). > This turns out to be an easy change to the codebase. > > Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool > introduces a huge amount of vertical whitespace (e.g. closing parentheses > on their own line), e.g. > > def foo( > args > ): > if ( > long expression) > ): > func( > args > ) > > I wrote a simple post-processor to put closing parentheses on the same > lines, as well as omit the newline after "if (", and disabling formatting > of strings, which reduce the churn in our codebase to 15k lines (adding > about 4k) out of 200k total. > > https://github.com/apache/beam/pull/8712/files > > It's still very opinionated, often in different ways then me, and doesn't > understand the semantics of the code, but possibly something we could live > with given the huge advantages of an autoformatter. > > An intermediate point would be to allow, but not require, autoformatting > of changed lines. > > As for being beta quality, it looks like it's got a decent number of > contributors and in my book being in the python github project is a strong > positive signal. But, due to the above issues, I think we'd have to > maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue > is a 2-line change, and the rest implemented as a post-processing step (for > now, incomplete), so it'd be easy to stay in sync with upstream.) > > On Wed, May 29, 2019 at 11:03 AM Ismaël
Re: Definition of Unified model
Expanding the dimensionality could be the basis for loops within the graph since loops could be modeled as (time, loop iteration #, nested loop iteration #, nested nested loop iteration #, ...) On Tue, May 28, 2019 at 12:10 PM Jan Lukavský wrote: > Could this be solved by "expanding the dimensionality" of time field? What > I mean by that - if input element to to FlatMap has sequence number T, then > the (stateless) FlatMap knows the ordering of output elements, right? If it > would expand the field by which it will next sort the elements to (X, 1), > (X, 2), ... (X, N), then it would be possible to sort the elements back > later. There seems to be no need for state to achieve that, or? > > Jan > On 5/28/19 6:52 PM, Reuven Lax wrote: > > A slightly larger concern: it also will force users to create stateful > DoFns everywhere to generate these sequence numbers. If I have a ParDo that > is not a simple 1:1 transform (i.e. not MapElements), then the ParDo will > need to generate its own sequence numbers for ordering, and the only safe > way to do so is to use a stateful DoFn. This turns what used to be a simple > in-memory DoFn into one that has to access state. Also I believe many > runners will not fuse stateful DoFns. While none of this poses a problem > for the model, it could make ordering extremely expensive to achieve. > > Reuven > > On Tue, May 28, 2019 at 6:09 AM Jan Lukavský wrote: > >> Hi Reuven, >> >> > It also gets awkward with Flatten - the sequence number is no longer >> enough, you must also encode which side of the flatten each element came >> from. >> >> That is a generic need. Even if you read data from Kafka, the offsets are >> comparable only inside single partition. So, for Kafka to work as a FIFO >> for ordering, elements with same key have to be pushed to the same >> partition (otherwise Kafka cannot act as FIFO, because different partitions >> can be handled by different brokers, which means different observers and >> they therefore might not agree on the order of events). So if we want to >> emulate FIFO per key, then the sequence IDs have also be per key. >> On 5/28/19 2:33 PM, Reuven Lax wrote: >> >> Sequence metadata does have the disadvantage that users can no longer use >> the types coming from the source. You must create a new type that contains >> a sequence number (unless Beam provides this). It also gets awkward with >> Flatten - the sequence number is no longer enough, you must also encode >> which side of the flatten each element came from. >> >> On Tue, May 28, 2019 at 3:18 AM Jan Lukavský wrote: >> >>> As I understood it, Kenn was supporting the idea that sequence metadata >>> is preferable over FIFO. I was trying to point out, that it even should >>> provide the same functionally as FIFO, plus one important more - >>> reproducibility and ability to being persisted and reused the same way >>> in batch and streaming. >>> >>> There is no doubt, that sequence metadata can be stored in every >>> storage. But, regarding some implicit ordering that sources might have - >>> yes, of course, data written into HDFS or Cloud Storage has ordering, >>> but only partial - inside some bulk (e.g. file) and the ordering is not >>> defined correctly on boundaries of these bulks (between files). That is >>> why I'd say, that ordering of sources is relevant only for >>> (partitioned!) streaming sources and generally always reduces to >>> sequence metadata (e.g. offsets). >>> >>> Jan >>> >>> On 5/28/19 11:43 AM, Robert Bradshaw wrote: >>> > Huge +1 to all Kenn said. >>> > >>> > Jan, batch sources can have orderings too, just like Kafka. I think >>> > it's reasonable (for both batch and streaming) that if a source has an >>> > ordering that is an important part of the data, it should preserve >>> > this ordering into the data itself (e.g. as sequence numbers, offsets, >>> > etc.) >>> > >>> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles >>> wrote: >>> >> I strongly prefer explicit sequence metadata over FIFO requirements, >>> because: >>> >> >>> >> - FIFO is complex to specify: for example Dataflow has "per stage >>> key-to-key" FIFO today, but it is not guaranteed to remain so (plus "stage" >>> is not a portable concept, nor even guaranteed to remain a Dataflow concept) >>> >> - complex specifications are by definition poor usability (if >>> necessary, then it is what it is) >>> >> - overly restricts the runner, reduces parallelism, for example any >>> non-stateful ParDo has per-element parallelism, not per "key" >>> >> - another perspective on that: FIFO makes everyone pay rather than >>> just the transform that requires exactly sequencing >>> >> - previous implementation details like reshuffles become part of >>> the model >>> >> - I'm not even convinced the use cases involved are addressed by >>> some careful FIFO restrictions; many sinks re-key and they would all have >>> to become aware of how keying of a sequence of "stages" affects the >>> end-to-end FIFO >>> >>
Re: [DISCUSS] Autoformat python code with Black
Thank you all for doing the work. The demo PR looks decent. I still have 3 major concerns. (1) I prefer not to maintain a fork with our own customization for a tool. It will add up to our maintenance costs. (2) Even the small change touches about 20k lines of code. (3) On beta note, the specific note is "until the formatter becomes stable, you should expect some formatting to change in the future." That signals like we might see formatting churn. And as a minor point I do not understand why some reformatting happened. For example, was the following change due to pep-8 compliance or the tool being opinionated? if (not hasattr(result, 'has_job')# direct runner or result.has_job): # not just a template creation --> if (not hasattr(result, 'has_job') or result.has_job # direct runner ): # not just a template creation So far all of us agree on benefits of an autoformatter. I do not wish to block this. Consider me -0 as long as someone is willing to maintain a fork that works for our purposes. Ahmet On Wed, May 29, 2019 at 5:01 AM Robert Bradshaw wrote: > Reformatting to 4 spaces seems a non-starter to me, as it would change > nearly every single line in the codebase (and the loss of all context as > well as that particular line). > > This is probably why the 2-space fork exists. However, we don't conform to > that either--we use 2 spaces for indentation, but 4 for continuation > indentation. (As for the history of this, this goes back to Google's > internal style guide, probably motivated by consistency with C++, Java, ... > and the fact that with an indent level of 4 one ends up wrapping lines > quite frequently (it's telling that black's default line length is 88)). > This turns out to be an easy change to the codebase. > > Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool > introduces a huge amount of vertical whitespace (e.g. closing parentheses > on their own line), e.g. > > def foo( > args > ): > if ( > long expression) > ): > func( > args > ) > > I wrote a simple post-processor to put closing parentheses on the same > lines, as well as omit the newline after "if (", and disabling formatting > of strings, which reduce the churn in our codebase to 15k lines (adding > about 4k) out of 200k total. > > https://github.com/apache/beam/pull/8712/files > > It's still very opinionated, often in different ways then me, and doesn't > understand the semantics of the code, but possibly something we could live > with given the huge advantages of an autoformatter. > > An intermediate point would be to allow, but not require, autoformatting > of changed lines. > > As for being beta quality, it looks like it's got a decent number of > contributors and in my book being in the python github project is a strong > positive signal. But, due to the above issues, I think we'd have to > maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue > is a 2-line change, and the rest implemented as a post-processing step (for > now, incomplete), so it'd be easy to stay in sync with upstream.) > > On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía wrote: > > > > > I think the question is if it can be configured in a way to fit our > > > current linter's style. I don't think it is feasible to reformat the > > > entire Python SDK. > > > > It cannot be configured to do what we actually do because Black is > > configurable only to support the standard python codestyle guidelines > > (PEP-8) which recommends 4 spaces and is what most projects in the > > python world use. > > > > > Reformatted lines don't allow quick access to the Git history. This > > > effect is still visible in the Java SDK. However, I have the feeling > > > that this might be less of a problem with Python because the linter has > > > more rules than Checkstyle had. > > > > Yes that’s the bad side effect but there are always tradeoffs we have > > to deal with. > > > > > > > > > > On Wed, May 29, 2019 at 10:52 AM Maximilian Michels > wrote: > > > > > > I think the question is if it can be configured in a way to fit our > > > current linter's style. I don't think it is feasible to reformat the > > > entire Python SDK. > > > > > > Reformatted lines don't allow quick access to the Git history. This > > > effect is still visible in the Java SDK. However, I have the feeling > > > that this might be less of a problem with Python because the linter has > > > more rules than Checkstyle had. > > > > > > -Max > > > > > > On 29.05.19 10:16, Ismaël Mejía wrote: > > > >> My concerns are: > > > >> - The product is clearly marked as beta with a big warning. > > > >> - It looks like mostly a single person project. For the same reason > I also strongly prefer not using a fork for a specific setting. Fork will > only have less people looking at it. > > > > > > > > I suppose the project is marked as beta because it is recent, it was > > > > presented in 2018’s pycon, and because some
Re: Question about building Go SDK
I have gotten the same error on other dependencies. Gogradle sometimes fails to pull down the dependencies. Retrying ./gradlew :sdks:go:resolveBuildDependencies in the past did allow me to fetch the dependencies. Others who focus only on the Go SDK work using the standard Go development process by checking out the apache beam package into their GOPATH editing it there. I rarely do this because I typically focus on changes that are either outside the Go SDK or span multiple languages. On Wed, May 29, 2019 at 5:00 AM Kengo Seki wrote: > It was my fault, probably my gradle cache was broken. > Running gradlew succeeded after removing ~/.gradle. > I'm sorry for bothering you. > > Kengo Seki > > On Wed, May 29, 2019 at 6:45 PM Kengo Seki wrote: > > > > Hi Beam developers, > > > > I tried to build Go SDK on the master branch, but encountered the > > following error. > > > > ``` > > $ ./gradlew :sdks:go:resolveBuildDependencies > > > > (snip) > > > > FAILURE: Build failed with an exception. > > > > * What went wrong: > > Execution failed for task ':sdks:go:resolveBuildDependencies'. > > > Exception in resolution, message is: > > Cannot resolve dependency:github.com/coreos/etcd: > > commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921', > > urls=[https://github.com/co > > reos/etcd.git, g...@github.com:coreos/etcd.git] > > Resolution stack is: > > +- github.com/apache/beam/sdks/go > > ``` > > > > Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in > > gogradle.lock seems to work. > > > > ``` > > $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock > > $ ./gradlew :sdks:go:resolveBuildDependencies > > > > (snip) > > > > BUILD SUCCESSFUL in 4s > > 2 actionable tasks: 2 executed > > ``` > > > > But I'm not sure whether it's a valid solution, since I'm not familiar > > with Gogradle and it seems that gogradle.lock shouldn't be edited > > directly according to its header comment. > > Is the above approach valid, or is there a more proper way to solve > > this problem...? > > > > Kengo Seki >
Re: Shuffling on apache beam
If you add a stateful DoFn to your pipeline, you'll force Beam to shuffle data to their corresponding worker per key. I am not sure what is the latency cost of doing this (as the messages still need to be shuffled). But it may help you accomplish this without adding windowing+triggering. -P. On Wed, May 29, 2019 at 5:16 AM pasquale.bon...@gmail.com < pasquale.bon...@gmail.com> wrote: > Hi Reza, > with GlobalWindow with triggering I was able to reduce hotspot issues > gaining satisfying performance for BigTable update. Unfortunately latency > when getting messages from PubSub remains around 1.5s that it's too much > considering our NFR. > > This is the code I use to get the messages: > PCollectionTuple rawTransactions = p // > .apply("GetMessages", > > PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME) > > .withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic)) > .apply(Window.configure() > .triggering(Repeatedly > > .forever(AfterWatermark.pastEndOfWindow() > .withEarlyFirings( > AfterProcessingTime > > .pastFirstElementInPane() > > .plusDelayOf(Duration.millis(1))) > // Fire on any > late data > > .withLateFirings(AfterPane.elementCountAtLeast(1 > .discardingFiredPanes()) > > Messages are produced with a different dataflow: > Pipeline p = Pipeline.create(options); > p.apply( > "ReadFile", > TextIO.read() > .from(options.getInputLocation() + "/*.csv") > .watchForNewFiles( > // Check for new files every 1 seconds > Duration.millis(600), > // Never stop checking for new files > Watch.Growth.never())) > .apply( > "create message", > ParDo.of( > new DoFn() { > @ProcessElement > public void processElement(ProcessContext context) { > String line = context.element(); > > String payload = convertRow(line); > long now = > LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli(); > context.output( > new PubsubMessage( > payload.getBytes(), > ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME, > payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME, > Long.toString(now; > } > })) > .apply("publish message", PubsubIO.writeMessages().to(topic)); > > I'm uploading a file containing 100 rows every 600 ms. > > I found different threads on satckoverflow around this latency issue, but > none has a solution. > > > > > On 2019/05/24 07:19:02, Reza Rokni wrote: > > PS You can also make use of the GlobalWindow with a stateful DoFn. > > > > On Fri, 24 May 2019 at 15:13, Reza Rokni wrote: > > > > > Hi, > > > > > > Have you explored the use of triggers with your use case? > > > > > > > > > > https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html > > > > > > Cheers > > > > > > Reza > > > > > > On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com < > > > pasquale.bon...@gmail.com> wrote: > > > > > >> Hi Reuven, > > >> I would like to know if is possible to guarantee that record are > > >> processed by the same thread/task based on a key, as probably happens > in a > > >> combine/stateful operation, without adding the delay of a windows. > > >> This could increase efficiency of caching and reduce same racing > > >> condition when writing data. > > >> I understand that workers are not part of programming model so I would > > >> like to know if it's possible to achieve this behaviour reducing at > minimum > > >> the delay of windowing. We don't need any combine or state we just > want the > > >> all record with a given key are sent to same thread, > > >> > > >> Thanks > > >> > > >> > > >> On 2019/05/24 03:20:13, Reuven Lax wrote: > > >> > Can you explain what you mean by worker? While every runner has > workers > > >> of > > >> > course, workers are not part of the programming model. > > >> > > > >> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com < > > >> > pasquale.bon...@gmail.com> wrote: > > >> > > > >> > > Hi all, > > >> > > I would like to know if Apache Beam has a functionality similar to > > >> > > fieldsGrouping in Storm that allows to send records to a specific > > >> > > task/worker based on a key. > > >> > > I know that we can achieve that with a combine/grouByKey > operation but > > >> > > that implies to add a windowing in our pipeline that we don't > want. > > >> > >
Re: [VOTE] Release 2.13.0, release candidate #1
We have a quite a bit of cherry pick requests. Are they all for major/blocking issues? Have we uncovered issues in release validation that is normally missing in our daily tests? On Wed, May 29, 2019 at 10:20 AM Thomas Weise wrote: > Added: https://github.com/apache/beam/pull/8714 > > > On Tue, May 28, 2019 at 4:03 PM Ankur Goenka wrote: > >> Open cherry pick PRs for spark runner >> https://github.com/apache/beam/pull/8705 >> https://github.com/apache/beam/pull/8706 >> >> On Tue, May 28, 2019 at 3:42 PM Valentyn Tymofieiev >> wrote: >> >>> Yes, looking into that. >>> >>> On Tue, May 28, 2019 at 3:37 PM Ankur Goenka wrote: >>> Valentyn, Can you please send the cherry pick PR for https://issues.apache.org/jira/browse/BEAM-7439 On Tue, May 28, 2019 at 3:04 PM Ankur Goenka wrote: > Sure, I will cherry pick those PRs. > > On Tue, May 28, 2019 at 2:19 PM Kyle Weaver > wrote: > >> Hi Ankur, >> >> It's not a blocker, but I'd like to see >> https://github.com/apache/beam/pull/8558 and >> https://github.com/apache/beam/pull/8569 be included so TFX examples >> can be run without errors on the 2.13.0 Spark runner ( >> https://github.com/tensorflow/tfx/pull/84). >> >> Kyle Weaver | Software Engineer | github.com/ibzib | >> kcwea...@google.com | +1650203 >> >> >> On Tue, May 28, 2019 at 11:53 AM Ankur Goenka >> wrote: >> >>> Thanks for the validation. >>> >>> I have marked fixed version of >>> https://issues.apache.org/jira/browse/BEAM-7406 >>> https://issues.apache.org/jira/browse/BEAM-6380 to be 2.13.0 and >>> will cherry pick the associated commits to the jira. >>> >>> >>> On Tue, May 28, 2019 at 11:19 AM Lukasz Cwik >>> wrote: >>> I would also suggest to get https://github.com/apache/beam/pull/8668 in to 2.13.0 since it fixes a logging setup issue on Dataflow (BEAM-7406). On Tue, May 28, 2019 at 10:22 AM Chamikara Jayalath < chamik...@google.com> wrote: > I would also like to get https://github.com/apache/beam/pull/8661 in > to 2.13.0 that fixes > https://issues.apache.org/jira/browse/BEAM-6380. It's not a new > issue but has affected a number of users. > > - Cham > > On Tue, May 28, 2019 at 9:31 AM Valentyn Tymofieiev < > valen...@google.com> wrote: > >> Thanks, Juta Staes, for reporting this issue. >> >> On Tue, May 28, 2019, 9:19 AM Valentyn Tymofieiev < >> valen...@google.com> wrote: >> >>> -1. >>> I would like us to fix >>> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It >>> is a regression that happened in 2.12.0, but was not caught by >>> existing >>> tests. >>> >>> Thanks, >>> Valentyn >>> >>> On Wed, May 22, 2019, 4:30 PM Ankur Goenka >>> wrote: >>> Hi everyone, Please review and vote on the release candidate #1 for the version 2.13.0, as follows: [ ] +1, Approve the release [ ] -1, Do not approve the release (please provide specific comments) The complete staging area is available for your review, which includes: * JIRA release notes [1], * the official Apache source release to be deployed to dist.apache.org [2], which is signed with the key with fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3], * all artifacts to be deployed to the Maven Central Repository [4], * source code tag "v2.13.0-RC1" [5], * website pull request listing the release [6] and publishing the API reference manual [7]. * Python artifacts are deployed along with the source release to the dist.apache.org [2]. * Validation sheet with a tab for 2.13.0 release to help with validation [8]. The vote will be open for at least 72 hours. It is adopted by majority approval, with at least 3 PMC affirmative votes. Thanks, Ankur [1] https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ [3] https://dist.apache.org/repos/dist/release/beam/KEYS [4] https://repository.apache.org/content/repositories/orgapachebeam-1069/ [5] https://github.com/apache/beam/tree/v2.13.0-RC1 [6] https://github.com/apache/beam/pull/8645 [7] https://github.com/apache/beam-site/pull/589 [8]
Re: [DISCUSS] Autoformat python code with Black
Thanks Ismaël for bringing this up. Dealing with Py lint issues has been painful and unproductive. For the Java side it became ./gradlew spotlessApply and there is no looking back. To get there it was also necessary to reformat some of the code, but the net benefits are evident. Even if we have to maintain a fork for the formatter, it will probably still be a big win. Thomas On Wed, May 29, 2019 at 5:01 AM Robert Bradshaw wrote: > Reformatting to 4 spaces seems a non-starter to me, as it would change > nearly every single line in the codebase (and the loss of all context as > well as that particular line). > > This is probably why the 2-space fork exists. However, we don't conform to > that either--we use 2 spaces for indentation, but 4 for continuation > indentation. (As for the history of this, this goes back to Google's > internal style guide, probably motivated by consistency with C++, Java, ... > and the fact that with an indent level of 4 one ends up wrapping lines > quite frequently (it's telling that black's default line length is 88)). > This turns out to be an easy change to the codebase. > > Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool > introduces a huge amount of vertical whitespace (e.g. closing parentheses > on their own line), e.g. > > def foo( > args > ): > if ( > long expression) > ): > func( > args > ) > > I wrote a simple post-processor to put closing parentheses on the same > lines, as well as omit the newline after "if (", and disabling formatting > of strings, which reduce the churn in our codebase to 15k lines (adding > about 4k) out of 200k total. > > https://github.com/apache/beam/pull/8712/files > > It's still very opinionated, often in different ways then me, and doesn't > understand the semantics of the code, but possibly something we could live > with given the huge advantages of an autoformatter. > > An intermediate point would be to allow, but not require, autoformatting > of changed lines. > > As for being beta quality, it looks like it's got a decent number of > contributors and in my book being in the python github project is a strong > positive signal. But, due to the above issues, I think we'd have to > maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue > is a 2-line change, and the rest implemented as a post-processing step (for > now, incomplete), so it'd be easy to stay in sync with upstream.) > > On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía wrote: > > > > > I think the question is if it can be configured in a way to fit our > > > current linter's style. I don't think it is feasible to reformat the > > > entire Python SDK. > > > > It cannot be configured to do what we actually do because Black is > > configurable only to support the standard python codestyle guidelines > > (PEP-8) which recommends 4 spaces and is what most projects in the > > python world use. > > > > > Reformatted lines don't allow quick access to the Git history. This > > > effect is still visible in the Java SDK. However, I have the feeling > > > that this might be less of a problem with Python because the linter has > > > more rules than Checkstyle had. > > > > Yes that’s the bad side effect but there are always tradeoffs we have > > to deal with. > > > > > > > > > > On Wed, May 29, 2019 at 10:52 AM Maximilian Michels > wrote: > > > > > > I think the question is if it can be configured in a way to fit our > > > current linter's style. I don't think it is feasible to reformat the > > > entire Python SDK. > > > > > > Reformatted lines don't allow quick access to the Git history. This > > > effect is still visible in the Java SDK. However, I have the feeling > > > that this might be less of a problem with Python because the linter has > > > more rules than Checkstyle had. > > > > > > -Max > > > > > > On 29.05.19 10:16, Ismaël Mejía wrote: > > > >> My concerns are: > > > >> - The product is clearly marked as beta with a big warning. > > > >> - It looks like mostly a single person project. For the same reason > I also strongly prefer not using a fork for a specific setting. Fork will > only have less people looking at it. > > > > > > > > I suppose the project is marked as beta because it is recent, it was > > > > presented in 2018’s pycon, and because some things can change since > > > > auto-formatters are pretty tricky beasts, I think beta in that case > is > > > > like our own ‘@Experimental’. If you look at the contribution page > [1] > > > > you can notice that it is less and less a single person project, > there > > > > have been 93 independent contributions since the project became > > > > public, and the fact that it is hosted in the python organization > > > > github [2] gives some confidence on the project continuity. > > > > > > > > You are right however about the fact that the main author seems to be > > > > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem > > > > arbitrary, but he
Re: [VOTE] Release 2.13.0, release candidate #1
Added: https://github.com/apache/beam/pull/8714 On Tue, May 28, 2019 at 4:03 PM Ankur Goenka wrote: > Open cherry pick PRs for spark runner > https://github.com/apache/beam/pull/8705 > https://github.com/apache/beam/pull/8706 > > On Tue, May 28, 2019 at 3:42 PM Valentyn Tymofieiev > wrote: > >> Yes, looking into that. >> >> On Tue, May 28, 2019 at 3:37 PM Ankur Goenka wrote: >> >>> Valentyn, Can you please send the cherry pick PR for >>> https://issues.apache.org/jira/browse/BEAM-7439 >>> >>> On Tue, May 28, 2019 at 3:04 PM Ankur Goenka wrote: >>> Sure, I will cherry pick those PRs. On Tue, May 28, 2019 at 2:19 PM Kyle Weaver wrote: > Hi Ankur, > > It's not a blocker, but I'd like to see > https://github.com/apache/beam/pull/8558 and > https://github.com/apache/beam/pull/8569 be included so TFX examples > can be run without errors on the 2.13.0 Spark runner ( > https://github.com/tensorflow/tfx/pull/84). > > Kyle Weaver | Software Engineer | github.com/ibzib | > kcwea...@google.com | +1650203 > > > On Tue, May 28, 2019 at 11:53 AM Ankur Goenka > wrote: > >> Thanks for the validation. >> >> I have marked fixed version of >> https://issues.apache.org/jira/browse/BEAM-7406 >> https://issues.apache.org/jira/browse/BEAM-6380 to be 2.13.0 and >> will cherry pick the associated commits to the jira. >> >> >> On Tue, May 28, 2019 at 11:19 AM Lukasz Cwik >> wrote: >> >>> I would also suggest to get https://github.com/apache/beam/pull/8668 >>> in to 2.13.0 since it fixes a logging setup issue on Dataflow >>> (BEAM-7406). >>> >>> On Tue, May 28, 2019 at 10:22 AM Chamikara Jayalath < >>> chamik...@google.com> wrote: >>> I would also like to get https://github.com/apache/beam/pull/8661 in to 2.13.0 that fixes https://issues.apache.org/jira/browse/BEAM-6380. It's not a new issue but has affected a number of users. - Cham On Tue, May 28, 2019 at 9:31 AM Valentyn Tymofieiev < valen...@google.com> wrote: > Thanks, Juta Staes, for reporting this issue. > > On Tue, May 28, 2019, 9:19 AM Valentyn Tymofieiev < > valen...@google.com> wrote: > >> -1. >> I would like us to fix >> https://issues.apache.org/jira/browse/BEAM-7439 for 2.13.0. It >> is a regression that happened in 2.12.0, but was not caught by >> existing >> tests. >> >> Thanks, >> Valentyn >> >> On Wed, May 22, 2019, 4:30 PM Ankur Goenka >> wrote: >> >>> Hi everyone, >>> >>> Please review and vote on the release candidate #1 for the >>> version 2.13.0, as follows: >>> >>> [ ] +1, Approve the release >>> [ ] -1, Do not approve the release (please provide specific >>> comments) >>> >>> The complete staging area is available for your review, which >>> includes: >>> * JIRA release notes [1], >>> * the official Apache source release to be deployed to >>> dist.apache.org [2], which is signed with the key with >>> fingerprint 6356C1A9F089B0FA3DE8753688934A6699985948 [3], >>> * all artifacts to be deployed to the Maven Central Repository >>> [4], >>> * source code tag "v2.13.0-RC1" [5], >>> * website pull request listing the release [6] and publishing >>> the API reference manual [7]. >>> * Python artifacts are deployed along with the source release to >>> the dist.apache.org [2]. >>> * Validation sheet with a tab for 2.13.0 release to help with >>> validation [8]. >>> >>> The vote will be open for at least 72 hours. It is adopted by >>> majority approval, with at least 3 PMC affirmative votes. >>> >>> Thanks, >>> Ankur >>> >>> [1] >>> https://jira.apache.org/jira/secure/ReleaseNote.jspa?projectId=12319527=12345166 >>> [2] https://dist.apache.org/repos/dist/dev/beam/2.13.0/ >>> [3] https://dist.apache.org/repos/dist/release/beam/KEYS >>> [4] >>> https://repository.apache.org/content/repositories/orgapachebeam-1069/ >>> [5] https://github.com/apache/beam/tree/v2.13.0-RC1 >>> [6] https://github.com/apache/beam/pull/8645 >>> [7] https://github.com/apache/beam-site/pull/589 >>> [8] >>> https://docs.google.com/spreadsheets/d/1qk-N5vjXvbcEk68GjbkSZTR8AGqyNUM-oLFo_ZXBpJw/edit#gid=1031196952 >>> >>
Re: Hazelcast Jet Runner
Hi, Over 800 usages under java, might be worth doing a few PR... Also suggest we use a very light review process: First round go for low hanging fruit, if anyone does a -1 against a change then we leave that for round two. Thoughts? Cheers Reza On Wed, 29 May 2019 at 12:05, Kenneth Knowles wrote: > > > On Mon, May 27, 2019 at 4:05 PM Reza Rokni wrote: > >> "Many APIs that have been in place for years and are used by most Beam >> users are still marked Experimental." >> >> Should there be a formal process in place to start 'graduating' features >> out of @Experimental? Perhaps even target an up coming release with a PR to >> remove the annotation from well established API's? >> > > Good idea. I think a PR like this would be an opportunity to discuss > whether the feature is non-experimental. Probably many of them are ready. > It would help to address Ismael's very good point that this new practice > could make users think the old Experimental stuff is not experimental. > Maybe it is true that it is not really still Experimental. > > Kenn > > > >> On Tue, 28 May 2019 at 06:44, Reuven Lax wrote: >> >>> We generally use Experimental for two different things, which leads to >>> confusion. >>> 1. Features that work stably, but where we think we might still make >>> some changes to the API. >>> 2. New features that we think might not yet be stable. >>> >>> This dual usage leads to a lot of confusion IMO. The fact that we tend >>> to forget to remove the @Experimental tag also makes it somewhat useless. >>> Many APIs that have been in place for years and are used by most Beam users >>> are still marked Experimental. >>> >>> Reuven >>> >>> On Mon, May 27, 2019 at 2:16 PM Ismaël Mejía wrote: >>> > Personally, I think that it is good that moving from experimental to non-experimental is a breaking change in the dependency - one has backwards-incompatible changes and the other does not. If artifacts had separate versioning we could use 0.x for this. In theory it seems so, but in practice it is an annoyance to an end user that already took the ‘risk’ of using an experimental feature. Awareness is probably not the most important reason to break existing code (even if it could be easily fixed). The alternative of doing this with version numbers at least seems less impacting but can be confusing. > But biggest motivation for me are these: > > - using experimental features should be opt-in > - should be impossible to use an experimental feature without knowing it (so "opt-in" to a normal-looking feature is not enough) > - developers of an experimental feature should be motivated to "graduate" it The fundamental problem of this approach is inconsistency with our present/past. So far we have ‘Experimental’ features everywhere. So suddenly becoming opt-in let us in an inconsistent state. For example all IOs are marked internally as Experimental but not at the level of directories/artifacts. Adding this suffix in a new IO apart of adding fear of use to the end users may also give the fake impression that the older ones not explicitly marked are not experimental. What will be the state for example in the case of runner modules that contain both mature and well tested runners like old Flink and Spark runners vs the more experimental new translations for Portability, again more confusion. > FWIW I don't think "experimental" should be viewed as a bad thing. It just means you are able to make backwards-incompatible changes, and that users should be aware that they will need to adjust APIs (probably only a little) with new releases. Most software is not very good until it has been around for a long time, and in my experience the problem is missing the mark on abstractions, so backwards compatibility *must* be broken to achieve quality. Freezing it early dooms it to never achieving high quality. I know of projects where the users explicitly requested that the developers not freeze the API but instead prioritize speed and quality. I agree 100% on the arguments, but let’s think in the reverse terms, highlighting lack of maturity can play against the intended goal of use and adoption even if for a noble reason. It is basic priming 101 [1]. > Maybe the word is just too negative-sounding? Alternatives might be "unstable" or "incubating". Yes! “experimental” should not be viewed as a bad thing unless you are a company that has less resources and is trying to protect its investment so in that case they may doubt to use it. In this case probably incubating is a better term because it has less of the ‘tentative’ dimension associated with Experimental. > Now, for the Jet runner, most runners sit on a branch for a while, not being
Re: Definition of Unified model
On Tue, May 28, 2019 at 2:34 PM Reuven Lax wrote: > > Sequence metadata does have the disadvantage that users can no longer use the > types coming from the source. You must create a new type that contains a > sequence number (unless Beam provides this). Yes. Well, the source would have a withSequenceNumber or withOffest or similar to provide this capability. > It also gets awkward with Flatten - the sequence number is no longer enough, you must also encode which side of the flatten each element came from. Only if you want to order things across different PCollections. But, yes, you'd probably want to do a tagged flatten. I still think making it explicit, and providing the ability to (cheaply) order on it, is better than trying to come up with an implicit model of where things are and aren't ordered. (As an aside, we often have various pieces of metadata that are attached to elements: timestamps, windows, pane infos, possible trigger metadata (for data-driven triggers), sequence numbers, ...). It may be interesting to have the ability to project an element into a "visible" portion and an invisible "attached" portion, where the latter would flow through just as timestamps do (where possible), and then later make it visible again. Maybe the work with schema field unpacking, etc. to only look at the fields one is interested in is sufficient to sideline the noise of extra fields, but it wouldn't allow the application of a (more) arbitrary PTransform to a PCollection.) > On Tue, May 28, 2019 at 3:18 AM Jan Lukavský wrote: >> >> As I understood it, Kenn was supporting the idea that sequence metadata >> is preferable over FIFO. I was trying to point out, that it even should >> provide the same functionally as FIFO, plus one important more - >> reproducibility and ability to being persisted and reused the same way >> in batch and streaming. >> >> There is no doubt, that sequence metadata can be stored in every >> storage. But, regarding some implicit ordering that sources might have - >> yes, of course, data written into HDFS or Cloud Storage has ordering, >> but only partial - inside some bulk (e.g. file) and the ordering is not >> defined correctly on boundaries of these bulks (between files). That is >> why I'd say, that ordering of sources is relevant only for >> (partitioned!) streaming sources and generally always reduces to >> sequence metadata (e.g. offsets). >> >> Jan >> >> On 5/28/19 11:43 AM, Robert Bradshaw wrote: >> > Huge +1 to all Kenn said. >> > >> > Jan, batch sources can have orderings too, just like Kafka. I think >> > it's reasonable (for both batch and streaming) that if a source has an >> > ordering that is an important part of the data, it should preserve >> > this ordering into the data itself (e.g. as sequence numbers, offsets, >> > etc.) >> > >> > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles wrote: >> >> I strongly prefer explicit sequence metadata over FIFO requirements, >> >> because: >> >> >> >> - FIFO is complex to specify: for example Dataflow has "per stage >> >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus >> >> "stage" is not a portable concept, nor even guaranteed to remain a >> >> Dataflow concept) >> >> - complex specifications are by definition poor usability (if >> >> necessary, then it is what it is) >> >> - overly restricts the runner, reduces parallelism, for example any >> >> non-stateful ParDo has per-element parallelism, not per "key" >> >> - another perspective on that: FIFO makes everyone pay rather than just >> >> the transform that requires exactly sequencing >> >> - previous implementation details like reshuffles become part of the >> >> model >> >> - I'm not even convinced the use cases involved are addressed by some >> >> careful FIFO restrictions; many sinks re-key and they would all have to >> >> become aware of how keying of a sequence of "stages" affects the >> >> end-to-end FIFO >> >> >> >> A noop becoming a non-noop is essentially the mathematical definition of >> >> moving from higher-level to lower-level abstraction. >> >> >> >> So this strikes at the core question of what level of abstraction Beam >> >> aims to represent. Lower-level means there are fewer possible >> >> implementations and it is more tied to the underlying architecture, and >> >> anything not near-exact match pays a huge penalty. Higher-level means >> >> there are more implementations possible with different tradeoffs, though >> >> they may all pay a minor penalty. >> >> >> >> I could be convinced to change my mind, but it needs some extensive >> >> design, examples, etc. I think it is probably about the most >> >> consequential design decision in the whole Beam model, around the same >> >> level as the decision to use ParDo and GBK as the primitives IMO. >> >> >> >> Kenn >> >> >> >> On Thu, May 23, 2019 at 10:17 AM Reuven Lax wrote: >> >>> Not really. I'm suggesting that some variant of FIFO ordering is >> >>>
Re: Definition of Unified model
On Tue, May 28, 2019 at 12:18 PM Jan Lukavský wrote: > > As I understood it, Kenn was supporting the idea that sequence metadata > is preferable over FIFO. I was trying to point out, that it even should > provide the same functionally as FIFO, plus one important more - > reproducibility and ability to being persisted and reused the same way > in batch and streaming. > > There is no doubt, that sequence metadata can be stored in every > storage. But, regarding some implicit ordering that sources might have - > yes, of course, data written into HDFS or Cloud Storage has ordering, > but only partial - inside some bulk (e.g. file) and the ordering is not > defined correctly on boundaries of these bulks (between files). That is > why I'd say, that ordering of sources is relevant only for > (partitioned!) streaming sources and generally always reduces to > sequence metadata (e.g. offsets). Offsets within a file, unordered between files seems exactly analogous with offsets within a partition, unordered between partitions, right? > On 5/28/19 11:43 AM, Robert Bradshaw wrote: > > Huge +1 to all Kenn said. > > > > Jan, batch sources can have orderings too, just like Kafka. I think > > it's reasonable (for both batch and streaming) that if a source has an > > ordering that is an important part of the data, it should preserve > > this ordering into the data itself (e.g. as sequence numbers, offsets, > > etc.) > > > > On Fri, May 24, 2019 at 10:35 PM Kenneth Knowles wrote: > >> I strongly prefer explicit sequence metadata over FIFO requirements, > >> because: > >> > >> - FIFO is complex to specify: for example Dataflow has "per stage > >> key-to-key" FIFO today, but it is not guaranteed to remain so (plus > >> "stage" is not a portable concept, nor even guaranteed to remain a > >> Dataflow concept) > >> - complex specifications are by definition poor usability (if necessary, > >> then it is what it is) > >> - overly restricts the runner, reduces parallelism, for example any > >> non-stateful ParDo has per-element parallelism, not per "key" > >> - another perspective on that: FIFO makes everyone pay rather than just > >> the transform that requires exactly sequencing > >> - previous implementation details like reshuffles become part of the > >> model > >> - I'm not even convinced the use cases involved are addressed by some > >> careful FIFO restrictions; many sinks re-key and they would all have to > >> become aware of how keying of a sequence of "stages" affects the > >> end-to-end FIFO > >> > >> A noop becoming a non-noop is essentially the mathematical definition of > >> moving from higher-level to lower-level abstraction. > >> > >> So this strikes at the core question of what level of abstraction Beam > >> aims to represent. Lower-level means there are fewer possible > >> implementations and it is more tied to the underlying architecture, and > >> anything not near-exact match pays a huge penalty. Higher-level means > >> there are more implementations possible with different tradeoffs, though > >> they may all pay a minor penalty. > >> > >> I could be convinced to change my mind, but it needs some extensive > >> design, examples, etc. I think it is probably about the most consequential > >> design decision in the whole Beam model, around the same level as the > >> decision to use ParDo and GBK as the primitives IMO. > >> > >> Kenn
Re: Shuffling on apache beam
Hi Reza, with GlobalWindow with triggering I was able to reduce hotspot issues gaining satisfying performance for BigTable update. Unfortunately latency when getting messages from PubSub remains around 1.5s that it's too much considering our NFR. This is the code I use to get the messages: PCollectionTuple rawTransactions = p // .apply("GetMessages", PubsubIO.readMessagesWithAttributes().withIdAttribute(TRANSACTION_MESSAGE_ID_FIELD_NAME) .withTimestampAttribute(TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME).fromTopic(topic)) .apply(Window.configure() .triggering(Repeatedly .forever(AfterWatermark.pastEndOfWindow() .withEarlyFirings( AfterProcessingTime .pastFirstElementInPane() .plusDelayOf(Duration.millis(1))) // Fire on any late data .withLateFirings(AfterPane.elementCountAtLeast(1 .discardingFiredPanes()) Messages are produced with a different dataflow: Pipeline p = Pipeline.create(options); p.apply( "ReadFile", TextIO.read() .from(options.getInputLocation() + "/*.csv") .watchForNewFiles( // Check for new files every 1 seconds Duration.millis(600), // Never stop checking for new files Watch.Growth.never())) .apply( "create message", ParDo.of( new DoFn() { @ProcessElement public void processElement(ProcessContext context) { String line = context.element(); String payload = convertRow(line); long now = LocalDateTime.now().atZone(ZoneId.systemDefault()).toInstant().toEpochMilli(); context.output( new PubsubMessage( payload.getBytes(), ImmutableMap.of(TRANSACTION_MESSAGE_ID_FIELD_NAME, payload.split(",")[6],TRANSACTION_MESSAGE_TIMESTAMP_FIELD_NAME, Long.toString(now; } })) .apply("publish message", PubsubIO.writeMessages().to(topic)); I'm uploading a file containing 100 rows every 600 ms. I found different threads on satckoverflow around this latency issue, but none has a solution. On 2019/05/24 07:19:02, Reza Rokni wrote: > PS You can also make use of the GlobalWindow with a stateful DoFn. > > On Fri, 24 May 2019 at 15:13, Reza Rokni wrote: > > > Hi, > > > > Have you explored the use of triggers with your use case? > > > > > > https://beam.apache.org/releases/javadoc/2.12.0/org/apache/beam/sdk/transforms/windowing/Trigger.html > > > > Cheers > > > > Reza > > > > On Fri, 24 May 2019 at 14:14, pasquale.bon...@gmail.com < > > pasquale.bon...@gmail.com> wrote: > > > >> Hi Reuven, > >> I would like to know if is possible to guarantee that record are > >> processed by the same thread/task based on a key, as probably happens in a > >> combine/stateful operation, without adding the delay of a windows. > >> This could increase efficiency of caching and reduce same racing > >> condition when writing data. > >> I understand that workers are not part of programming model so I would > >> like to know if it's possible to achieve this behaviour reducing at minimum > >> the delay of windowing. We don't need any combine or state we just want the > >> all record with a given key are sent to same thread, > >> > >> Thanks > >> > >> > >> On 2019/05/24 03:20:13, Reuven Lax wrote: > >> > Can you explain what you mean by worker? While every runner has workers > >> of > >> > course, workers are not part of the programming model. > >> > > >> > On Thu, May 23, 2019 at 8:13 PM pasquale.bon...@gmail.com < > >> > pasquale.bon...@gmail.com> wrote: > >> > > >> > > Hi all, > >> > > I would like to know if Apache Beam has a functionality similar to > >> > > fieldsGrouping in Storm that allows to send records to a specific > >> > > task/worker based on a key. > >> > > I know that we can achieve that with a combine/grouByKey operation but > >> > > that implies to add a windowing in our pipeline that we don't want. > >> > > I have also tried using a stateful transformation. > >> > > I think that also in that case we should use a windowing, but I see > >> that a > >> > > job with a stateful ParDo operation can be submitted on Google > >> Dataflow > >>
Re: Timer support in Flink
Thanx Max! Reza On Wed, 29 May 2019, 16:38 Maximilian Michels, wrote: > Hi Reza, > > The detailed view of the capability matrix states: "The Flink Runner > supports timers in non-merging windows." > > That is still the case. Other than that, timers should be working fine. > > > It makes very heavy use of Event.Time timers and has to do some manual > DoFn cache work to get around some O(heavy) issues. > > If you are running on Flink 1.5, timer deletion suffers from O(n) > complexity which has been fixed in newer versions. > > Cheers, > Max > > On 29.05.19 03:27, Reza Rokni wrote: > > Hi Flink experts, > > > > I am getting ready to push a PR around a utility class for > timeseries join > > > > left.timestamp match to closest right.timestamp where right.timestamp <= > > left.timestamp. > > > > It makes very heavy use of Event.Time timers and has to do some manual > > DoFn cache work to get around some O(heavy) issues. Wanted to test > > things against Flink: In the capability matrix we have "~" for Timer > > support in Flink: > > > > https://beam.apache.org/documentation/runners/capability-matrix/ > > > > Is that page outdated, if not what are the areas that still need to be > > addressed please? > > > > Cheers > > > > Reza > > > > > > -- > > > > This email may be confidential and privileged. If you received this > > communication by mistake, please don't forward it to anyone else, please > > erase all copies and attachments, and please let me know that it has > > gone to the wrong person. > > > > The above terms reflect a potential business arrangement, are provided > > solely as a basis for further discussion, and are not intended to be and > > do not constitute a legally binding obligation. No legally binding > > obligations will be created, implied, or inferred until an agreement in > > final form is executed in writing by all parties involved. > > >
Re: [DISCUSS] Autoformat python code with Black
Reformatting to 4 spaces seems a non-starter to me, as it would change nearly every single line in the codebase (and the loss of all context as well as that particular line). This is probably why the 2-space fork exists. However, we don't conform to that either--we use 2 spaces for indentation, but 4 for continuation indentation. (As for the history of this, this goes back to Google's internal style guide, probably motivated by consistency with C++, Java, ... and the fact that with an indent level of 4 one ends up wrapping lines quite frequently (it's telling that black's default line length is 88)). This turns out to be an easy change to the codebase. Once we move beyond the 2 vs. 4 whitespace thing, I found that this tool introduces a huge amount of vertical whitespace (e.g. closing parentheses on their own line), e.g. def foo( args ): if ( long expression) ): func( args ) I wrote a simple post-processor to put closing parentheses on the same lines, as well as omit the newline after "if (", and disabling formatting of strings, which reduce the churn in our codebase to 15k lines (adding about 4k) out of 200k total. https://github.com/apache/beam/pull/8712/files It's still very opinionated, often in different ways then me, and doesn't understand the semantics of the code, but possibly something we could live with given the huge advantages of an autoformatter. An intermediate point would be to allow, but not require, autoformatting of changed lines. As for being beta quality, it looks like it's got a decent number of contributors and in my book being in the python github project is a strong positive signal. But, due to the above issues, I think we'd have to maintain a fork. (The code is pretty lightweight, the 2 vs. 4 space issue is a 2-line change, and the rest implemented as a post-processing step (for now, incomplete), so it'd be easy to stay in sync with upstream.) On Wed, May 29, 2019 at 11:03 AM Ismaël Mejía wrote: > > > I think the question is if it can be configured in a way to fit our > > current linter's style. I don't think it is feasible to reformat the > > entire Python SDK. > > It cannot be configured to do what we actually do because Black is > configurable only to support the standard python codestyle guidelines > (PEP-8) which recommends 4 spaces and is what most projects in the > python world use. > > > Reformatted lines don't allow quick access to the Git history. This > > effect is still visible in the Java SDK. However, I have the feeling > > that this might be less of a problem with Python because the linter has > > more rules than Checkstyle had. > > Yes that’s the bad side effect but there are always tradeoffs we have > to deal with. > > > > > On Wed, May 29, 2019 at 10:52 AM Maximilian Michels wrote: > > > > I think the question is if it can be configured in a way to fit our > > current linter's style. I don't think it is feasible to reformat the > > entire Python SDK. > > > > Reformatted lines don't allow quick access to the Git history. This > > effect is still visible in the Java SDK. However, I have the feeling > > that this might be less of a problem with Python because the linter has > > more rules than Checkstyle had. > > > > -Max > > > > On 29.05.19 10:16, Ismaël Mejía wrote: > > >> My concerns are: > > >> - The product is clearly marked as beta with a big warning. > > >> - It looks like mostly a single person project. For the same reason I also strongly prefer not using a fork for a specific setting. Fork will only have less people looking at it. > > > > > > I suppose the project is marked as beta because it is recent, it was > > > presented in 2018’s pycon, and because some things can change since > > > auto-formatters are pretty tricky beasts, I think beta in that case is > > > like our own ‘@Experimental’. If you look at the contribution page [1] > > > you can notice that it is less and less a single person project, there > > > have been 93 independent contributions since the project became > > > public, and the fact that it is hosted in the python organization > > > github [2] gives some confidence on the project continuity. > > > > > > You are right however about the fact that the main author seems to be > > > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem > > > arbitrary, but he is just following pep8 style guide recommendations > > > [3]. I am curious of why we (Beam) do not follow the 4 spaces > > > recommendation of PEP-8 or even Google's own Python style guide [4], > > > So, probably it should be to us to reconsider the current policy to > > > adapt to the standards (and the tool). > > > > > > I did a quick run of black with python 2.7 compatibility on > > > sdks/python and got only 4 parsing errors which is positive given the > > > size of our code base. > > > > > > 415 files reformatted, 45 files left unchanged, 4 files failed to reformat. > > > > > > error: cannot format > > >
Re: Question about building Go SDK
It was my fault, probably my gradle cache was broken. Running gradlew succeeded after removing ~/.gradle. I'm sorry for bothering you. Kengo Seki On Wed, May 29, 2019 at 6:45 PM Kengo Seki wrote: > > Hi Beam developers, > > I tried to build Go SDK on the master branch, but encountered the > following error. > > ``` > $ ./gradlew :sdks:go:resolveBuildDependencies > > (snip) > > FAILURE: Build failed with an exception. > > * What went wrong: > Execution failed for task ':sdks:go:resolveBuildDependencies'. > > Exception in resolution, message is: > Cannot resolve dependency:github.com/coreos/etcd: > commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921', > urls=[https://github.com/co > reos/etcd.git, g...@github.com:coreos/etcd.git] > Resolution stack is: > +- github.com/apache/beam/sdks/go > ``` > > Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in > gogradle.lock seems to work. > > ``` > $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock > $ ./gradlew :sdks:go:resolveBuildDependencies > > (snip) > > BUILD SUCCESSFUL in 4s > 2 actionable tasks: 2 executed > ``` > > But I'm not sure whether it's a valid solution, since I'm not familiar > with Gogradle and it seems that gogradle.lock shouldn't be edited > directly according to its header comment. > Is the above approach valid, or is there a more proper way to solve > this problem...? > > Kengo Seki
Question about building Go SDK
Hi Beam developers, I tried to build Go SDK on the master branch, but encountered the following error. ``` $ ./gradlew :sdks:go:resolveBuildDependencies (snip) FAILURE: Build failed with an exception. * What went wrong: Execution failed for task ':sdks:go:resolveBuildDependencies'. > Exception in resolution, message is: Cannot resolve dependency:github.com/coreos/etcd: commit='11214aa33bf5a47d3d9d8dafe0f6b97237dfe921', urls=[https://github.com/co reos/etcd.git, g...@github.com:coreos/etcd.git] Resolution stack is: +- github.com/apache/beam/sdks/go ``` Replacing the "coreos" organizations with "etcd-io" in GitHub URLs in gogradle.lock seems to work. ``` $ sed -i s,coreos/etcd,etcd-io/etcd,g sdks/go/gogradle.lock $ ./gradlew :sdks:go:resolveBuildDependencies (snip) BUILD SUCCESSFUL in 4s 2 actionable tasks: 2 executed ``` But I'm not sure whether it's a valid solution, since I'm not familiar with Gogradle and it seems that gogradle.lock shouldn't be edited directly according to its header comment. Is the above approach valid, or is there a more proper way to solve this problem...? Kengo Seki
Re: [DISCUSS] Autoformat python code with Black
> I think the question is if it can be configured in a way to fit our > current linter's style. I don't think it is feasible to reformat the > entire Python SDK. It cannot be configured to do what we actually do because Black is configurable only to support the standard python codestyle guidelines (PEP-8) which recommends 4 spaces and is what most projects in the python world use. > Reformatted lines don't allow quick access to the Git history. This > effect is still visible in the Java SDK. However, I have the feeling > that this might be less of a problem with Python because the linter has > more rules than Checkstyle had. Yes that’s the bad side effect but there are always tradeoffs we have to deal with. On Wed, May 29, 2019 at 10:52 AM Maximilian Michels wrote: > > I think the question is if it can be configured in a way to fit our > current linter's style. I don't think it is feasible to reformat the > entire Python SDK. > > Reformatted lines don't allow quick access to the Git history. This > effect is still visible in the Java SDK. However, I have the feeling > that this might be less of a problem with Python because the linter has > more rules than Checkstyle had. > > -Max > > On 29.05.19 10:16, Ismaël Mejía wrote: > >> My concerns are: > >> - The product is clearly marked as beta with a big warning. > >> - It looks like mostly a single person project. For the same reason I also > >> strongly prefer not using a fork for a specific setting. Fork will only > >> have less people looking at it. > > > > I suppose the project is marked as beta because it is recent, it was > > presented in 2018’s pycon, and because some things can change since > > auto-formatters are pretty tricky beasts, I think beta in that case is > > like our own ‘@Experimental’. If you look at the contribution page [1] > > you can notice that it is less and less a single person project, there > > have been 93 independent contributions since the project became > > public, and the fact that it is hosted in the python organization > > github [2] gives some confidence on the project continuity. > > > > You are right however about the fact that the main author seems to be > > the ‘benevolent’ dictator, and in the 2-spaces issue he can seem > > arbitrary, but he is just following pep8 style guide recommendations > > [3]. I am curious of why we (Beam) do not follow the 4 spaces > > recommendation of PEP-8 or even Google's own Python style guide [4], > > So, probably it should be to us to reconsider the current policy to > > adapt to the standards (and the tool). > > > > I did a quick run of black with python 2.7 compatibility on > > sdks/python and got only 4 parsing errors which is positive given the > > size of our code base. > > > > 415 files reformatted, 45 files left unchanged, 4 files failed to reformat. > > > > error: cannot format > > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py: > > Cannot parse: 47:22: _display_progress = print > > error: cannot format > > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py: > > Cannot parse: 151:18: file=sys.stderr) > > error: cannot format > > /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py: > > Cannot parse: 160:34: print(traceback_string, file=sys.stderr) > > error: cannot format > > /home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py: > > Cannot parse: 335:51: print('-->' if pc == last_pc else '', > > end=' ') > > > > I still think this can be positive for the project but well I am > > barely a contributor to the python code base so I let you the python > > maintainers to reconsider this, in any case it seems like a good > > improvement for the project. > > > > [1] https://github.com/python/black/graphs/contributors > > [2] https://github.com/python > > [3] https://www.python.org/dev/peps/pep-0008/#indentation > > [4] > > https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation > > > > On Tue, May 28, 2019 at 11:15 PM Ahmet Altay wrote: > >> > >> I am in the same boat with Robert, I am in favor of autoformatters but I > >> am not familiar with this one. My concerns are: > >> - The product is clearly marked as beta with a big warning. > >> - It looks like mostly a single person project. For the same reason I also > >> strongly prefer not using a fork for a specific setting. Fork will only > >> have less people looking at it. > >> > >> IMO, this is in an early stage for us. That said lint issues are real as > >> pointed in the thread. If someone would like to give it a try and see how > >> it would look like for us that would be interesting. > >> > >> On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk > >> wrote: > >>> > >>> This sounds really good. A lot of Jenkins jobs failures are caused by > >>> lint problems. > >>> I think it would be great to have something similar to
Re: [DISCUSS] Autoformat python code with Black
I think the question is if it can be configured in a way to fit our current linter's style. I don't think it is feasible to reformat the entire Python SDK. Reformatted lines don't allow quick access to the Git history. This effect is still visible in the Java SDK. However, I have the feeling that this might be less of a problem with Python because the linter has more rules than Checkstyle had. -Max On 29.05.19 10:16, Ismaël Mejía wrote: My concerns are: - The product is clearly marked as beta with a big warning. - It looks like mostly a single person project. For the same reason I also strongly prefer not using a fork for a specific setting. Fork will only have less people looking at it. I suppose the project is marked as beta because it is recent, it was presented in 2018’s pycon, and because some things can change since auto-formatters are pretty tricky beasts, I think beta in that case is like our own ‘@Experimental’. If you look at the contribution page [1] you can notice that it is less and less a single person project, there have been 93 independent contributions since the project became public, and the fact that it is hosted in the python organization github [2] gives some confidence on the project continuity. You are right however about the fact that the main author seems to be the ‘benevolent’ dictator, and in the 2-spaces issue he can seem arbitrary, but he is just following pep8 style guide recommendations [3]. I am curious of why we (Beam) do not follow the 4 spaces recommendation of PEP-8 or even Google's own Python style guide [4], So, probably it should be to us to reconsider the current policy to adapt to the standards (and the tool). I did a quick run of black with python 2.7 compatibility on sdks/python and got only 4 parsing errors which is positive given the size of our code base. 415 files reformatted, 45 files left unchanged, 4 files failed to reformat. error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py: Cannot parse: 47:22: _display_progress = print error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py: Cannot parse: 151:18: file=sys.stderr) error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py: Cannot parse: 160:34: print(traceback_string, file=sys.stderr) error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py: Cannot parse: 335:51: print('-->' if pc == last_pc else '', end=' ') I still think this can be positive for the project but well I am barely a contributor to the python code base so I let you the python maintainers to reconsider this, in any case it seems like a good improvement for the project. [1] https://github.com/python/black/graphs/contributors [2] https://github.com/python [3] https://www.python.org/dev/peps/pep-0008/#indentation [4] https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation On Tue, May 28, 2019 at 11:15 PM Ahmet Altay wrote: I am in the same boat with Robert, I am in favor of autoformatters but I am not familiar with this one. My concerns are: - The product is clearly marked as beta with a big warning. - It looks like mostly a single person project. For the same reason I also strongly prefer not using a fork for a specific setting. Fork will only have less people looking at it. IMO, this is in an early stage for us. That said lint issues are real as pointed in the thread. If someone would like to give it a try and see how it would look like for us that would be interesting. On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk wrote: This sounds really good. A lot of Jenkins jobs failures are caused by lint problems. I think it would be great to have something similar to Spotless in Java SDK (I heard there is problem with configuring Black with IntelliJ). On Mon, May 27, 2019 at 10:52 PM Robert Bradshaw wrote: I'm generally in favor of autoformatters, though I haven't looked at how well this particular one works. We might have to go with https://github.com/desbma/black-2spaces given https://github.com/python/black/issues/378 . On Mon, May 27, 2019 at 10:43 PM Pablo Estrada wrote: This looks pretty good:) I know at least a couple people (myself included) who've been annoyed by having to take care of lint issues that maybe a code formatter could save us. Thanks for sharing Ismael. -P. On Mon, May 27, 2019, 12:24 PM Ismaël Mejía wrote: I stumbled by chance into Black [1] a python code auto formatter that is becoming the 'de-facto' auto-formatter for python, and wanted to bring to the ML Is there interest from the python people to get this into the build? The introduction of spotless for Java has been a good improvement and maybe the python code base may benefit of this too. WDYT? [1] https://github.com/python/black
Re: Timer support in Flink
Hi Reza, The detailed view of the capability matrix states: "The Flink Runner supports timers in non-merging windows." That is still the case. Other than that, timers should be working fine. It makes very heavy use of Event.Time timers and has to do some manual DoFn cache work to get around some O(heavy) issues. If you are running on Flink 1.5, timer deletion suffers from O(n) complexity which has been fixed in newer versions. Cheers, Max On 29.05.19 03:27, Reza Rokni wrote: Hi Flink experts, I am getting ready to push a PR around a utility class for timeseries join left.timestamp match to closest right.timestamp where right.timestamp <= left.timestamp. It makes very heavy use of Event.Time timers and has to do some manual DoFn cache work to get around some O(heavy) issues. Wanted to test things against Flink: In the capability matrix we have "~" for Timer support in Flink: https://beam.apache.org/documentation/runners/capability-matrix/ Is that page outdated, if not what are the areas that still need to be addressed please? Cheers Reza -- This email may be confidential and privileged. If you received this communication by mistake, please don't forward it to anyone else, please erase all copies and attachments, and please let me know that it has gone to the wrong person. The above terms reflect a potential business arrangement, are provided solely as a basis for further discussion, and are not intended to be and do not constitute a legally binding obligation. No legally binding obligations will be created, implied, or inferred until an agreement in final form is executed in writing by all parties involved.
Re: [DISCUSS] Autoformat python code with Black
> My concerns are: > - The product is clearly marked as beta with a big warning. > - It looks like mostly a single person project. For the same reason I also > strongly prefer not using a fork for a specific setting. Fork will only have > less people looking at it. I suppose the project is marked as beta because it is recent, it was presented in 2018’s pycon, and because some things can change since auto-formatters are pretty tricky beasts, I think beta in that case is like our own ‘@Experimental’. If you look at the contribution page [1] you can notice that it is less and less a single person project, there have been 93 independent contributions since the project became public, and the fact that it is hosted in the python organization github [2] gives some confidence on the project continuity. You are right however about the fact that the main author seems to be the ‘benevolent’ dictator, and in the 2-spaces issue he can seem arbitrary, but he is just following pep8 style guide recommendations [3]. I am curious of why we (Beam) do not follow the 4 spaces recommendation of PEP-8 or even Google's own Python style guide [4], So, probably it should be to us to reconsider the current policy to adapt to the standards (and the tool). I did a quick run of black with python 2.7 compatibility on sdks/python and got only 4 parsing errors which is positive given the size of our code base. 415 files reformatted, 45 files left unchanged, 4 files failed to reformat. error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/interactive/display/display_manager.py: Cannot parse: 47:22: _display_progress = print error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/log_handler.py: Cannot parse: 151:18: file=sys.stderr) error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/runners/worker/sdk_worker.py: Cannot parse: 160:34: print(traceback_string, file=sys.stderr) error: cannot format /home/ismael/upstream/beam/sdks/python/apache_beam/typehints/trivial_inference.py: Cannot parse: 335:51: print('-->' if pc == last_pc else '', end=' ') I still think this can be positive for the project but well I am barely a contributor to the python code base so I let you the python maintainers to reconsider this, in any case it seems like a good improvement for the project. [1] https://github.com/python/black/graphs/contributors [2] https://github.com/python [3] https://www.python.org/dev/peps/pep-0008/#indentation [4] https://github.com/google/styleguide/blob/gh-pages/pyguide.md#34-indentation On Tue, May 28, 2019 at 11:15 PM Ahmet Altay wrote: > > I am in the same boat with Robert, I am in favor of autoformatters but I am > not familiar with this one. My concerns are: > - The product is clearly marked as beta with a big warning. > - It looks like mostly a single person project. For the same reason I also > strongly prefer not using a fork for a specific setting. Fork will only have > less people looking at it. > > IMO, this is in an early stage for us. That said lint issues are real as > pointed in the thread. If someone would like to give it a try and see how it > would look like for us that would be interesting. > > On Tue, May 28, 2019 at 4:44 AM Katarzyna Kucharczyk > wrote: >> >> This sounds really good. A lot of Jenkins jobs failures are caused by lint >> problems. >> I think it would be great to have something similar to Spotless in Java SDK >> (I heard there is problem with configuring Black with IntelliJ). >> >> On Mon, May 27, 2019 at 10:52 PM Robert Bradshaw wrote: >>> >>> I'm generally in favor of autoformatters, though I haven't looked at >>> how well this particular one works. We might have to go with >>> https://github.com/desbma/black-2spaces given >>> https://github.com/python/black/issues/378 . >>> >>> On Mon, May 27, 2019 at 10:43 PM Pablo Estrada wrote: >>> > >>> > This looks pretty good:) I know at least a couple people (myself >>> > included) who've been annoyed by having to take care of lint issues that >>> > maybe a code formatter could save us. >>> > Thanks for sharing Ismael. >>> > -P. >>> > >>> > >>> > On Mon, May 27, 2019, 12:24 PM Ismaël Mejía wrote: >>> >> >>> >> I stumbled by chance into Black [1] a python code auto formatter that >>> >> is becoming the 'de-facto' auto-formatter for python, and wanted to >>> >> bring to the ML Is there interest from the python people to get this >>> >> into the build? >>> >> >>> >> The introduction of spotless for Java has been a good improvement and >>> >> maybe the python code base may benefit of this too. >>> >> >>> >> WDYT? >>> >> >>> >> [1] https://github.com/python/black