Adding my comments below. On Mon, Jul 15, 2019 at 2:52 PM Kenneth Knowles <[email protected]> wrote:
> Gradle comments inline > > On Mon, Jul 15, 2019 at 2:30 AM Frederik Bode <[email protected]> > wrote: > >> Hi Mark & others, >> >> +1 on using this structure. I don't see any other alternative to gradle >> as some of the Python tasks have Java tasks as >> a dependency. You can't debug that using just `python nosetests... or >> tox`. Parallelizing such tasks requires different >> projects (and I don't think gradle supports multiple projects per >> directory), so for each python >> version we need a different folder. Having seperate build.gradle files >> for each python version would also enable the >> different versions to diverge on which tests they execute (e.g. not >> running some tests for python 3.5 and 3.6 to >> reduce Jenkins footprint in a PreCommit). >> >> Valentyn, the code duplication problem can be addressed using a Gradle >> script plugin , which >> is a build.script (maybe another name is better?) in >> /sdks/python/test-suites/[runner], which you then import in >> /sdks/python/test-suites/[runner]/pyXX/build.gradle with the >> correct pythonVersion set, and then using `apply from`. See [1] for an >> example. The location of >> the python2 test-suite can be remedied by moving it to your >> suggestion. +1 on that as well. >> > > The `apply from` syntax is another way of authoring a gradle plugin, and a > bit more limited. BeamModulePlugin used to be a script `build_rules.groovy` > applied that way, and we moved it to the magical buildSrc/ directory so it > could have its own clear dependencies (its the buildSrc/build.gradle) and > be refactored into pieces over time. It is just as easy to make a plugin > there and I would recommend it. This is what we did with the vendored > artifacts. > +1 using Gradle plugin. New build.gradle also creates an unnecessary Gradle project which adds overheads in execution. Since we already have some plugin examples in buildSrc/, adding extra one for a particular purpose (or using existing plugins) would be preferred. > > On the coupling of projects or the structure not being natural, I think >> that we can look at >> this differently. Right now, in the Python SDK, all common code for tests >> that needs to be parallelized is >> in placed in BeamModulePlugin, which in turn couples all projects that >> use it. It's one centralized >> item that couples many different projects. It has code from Java, Python >> and Go, and >> currently has almost 2000 lines of code, which is IMHO not the way to go. >> > > Exactly. Each of BeamModulePlugin.applyXYZNature(...) are probably good to > separate as other plugins, for example you could start BeamPythonPlugin. > The reason these are separate methods instead of separate plugins is > because we started off in one big `build_rules.groovy` file; it is > historical and totally OK to fix up. > One extra bonus for separating BeamModulePlugin per language is that we can avoid irrelevant precommit tests being triggered when making changes in BeamModulePlugin. This came to me recently when I made Python specific changes. Some Java and Go precommit tests triggered because I touched BeamModulePlugin. Sometimes the test is flaky so I have to investigate/verify the failure is not related to my changes. > > Moving the python code >> from this binary plugin to a script plugin file defined in the the parent >> directory of children >> projects that use that code (as described in the paragraph above) moves >> the coupling of a lot of projects through >> BeamModulePlugin (a global coupling), to a per sub-tree coupling (a local >> coupling). >> > > Relative filesystem paths are also a weakness that caused pain plenty of > times. I highly recommend building a plugin with an id rather than `apply > from` relative paths such as parent directories. Or if there is a way to > `apply from` without a relative path, that is probably even more confusing. > buildSrc is the way to go. It is also much faster since it gains > incremental compilation of the plugin. > > Kenn > > >> In short, yes it might be unnatural, but it's still better than before. >> >> [1] https://github.com/apache/beam/pull/8877 >> <https://github.com/apache/beam/pull/8877/files> >> >> Thanks, >> Frederik Bode >> >> On Mon, Jun 3, 2019 at 5:13 PM Valentyn Tymofieiev <[email protected]> >> wrote: >> >>> Hey Mark & others, >>> >>> We've been following the structure proposed in this thread to extend >>> test coverage for Beam Python SDK on Python 3.5, 3.6, 3.7 interpreters, see >>> [1]. >>> >>> This structure allowed us to add 3.x suites without slowing down the >>> pre/postcommit execution time. We can actually see a drop in precommit >>> latency [2] around March 23 we first made some Python 3.x suites run in >>> parallel, and we have added more suites since then without slowing down >>> pre/postcommits. Therefore I am in favor of this proposal, especially since >>> AFAIK we don't have better one. Thanks a lot! >>> >>> I do have some feedback on this proposal: >>> >>> 1. There is a duplication of gradle code between test suites for >>> different python minor versions, for example see the identical definition >>> of DirectRunner PostCommitIT suite for Python 3.6 and Python 3.7 [4,5]. >>> >>> Possible solution to reduce the duplication is to move common code that >>> defines a task into a separate groovy file shared across multiple gradle >>> files. We have an example of this, where enablePythonPerformanceTest() is >>> defined in BeamModulePlugin.groovy, and used in several build.gradle files >>> to create a gradle task required for performance tests, see: [6]. I >>> followed the same example in a Python 3 test suite for Portable Flink >>> Runner I am working on [3], however I am not sure if BeamModulePlugin is >>> the best place to define common gradle tasks to needed for Python CI. >>> Perhaps we can make a separate groovy file for this purpose in >>> sdk/python/test-suites? >>> >> >> I would suggest placing the shared code in a new file in >> https://github.com/apache/beam/tree/master/buildSrc/src/main/groovy/org/apache/beam/gradle, >> we have several other groovy files related to building defined there >> already. >> >> >>> 2. Python 3 test suites currently live in sdks/python/test-suites, while >>> most Python 2 suites are still defined in sdks/python/build.gradle. >>> >>> This may cause confusion for folks working on adding new Python suites. >>> If there is an overall agreement on proposed structure I suggest to start >>> moving Python 2 CI tasks out of sdks/python/build.gradle into >>> sdks/python/test-suites/[runner]/py27/build.gradle, or a common groovy >>> file. If there are better alternatives we can continue discussing them here. >>> >> >> For the runner specific Java ITs, we have been trying to get the tests to >> be placed inside the runners own directory instead of underneath the SDK >> directories. This was to make it easy to associate test task -> runner. So >> an alternative would be to have runners/[runner]/test-suites/py27/... be >> that location. >> >> >>> Thanks, >>> Valenyn >>> >>> >>> [1] https://github.com/apache/beam/tree/master/sdks/python/test-suites >>> [2] >>> http://104.154.241.245/d/_TNndF2iz/pre-commit-test-latency?orgId=1&from=1546507894013&to=1554189164736 >>> [3] https://github.com/apache/beam/pull/8745 >>> [4] >>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py36/build.gradle#L27 >>> [5] >>> https://github.com/apache/beam/blob/291f1e9fb5ce5ee4bb7e2519ffe40334fb5c08c5/sdks/python/test-suites/direct/py37/build.gradle#L27 >>> [6] >>> https://github.com/apache/beam/search?q=enablePythonPerformanceTest&unscoped_q=enablePythonPerformanceTest >>> >>> >>> On Fri, Mar 29, 2019 at 9:45 AM Udi Meiri <[email protected]> wrote: >>> >>>> I don't use gradle commands for Python development either, because they >>>> are slow (no incremental testing). >>>> >>>> >>>> >>>> On Fri, Mar 29, 2019 at 9:16 AM Michael Luckey <[email protected]> >>>> wrote: >>>> >>>>> >>>>> >>>>> On Fri, Mar 29, 2019 at 2:31 PM Robert Bradshaw <[email protected]> >>>>> wrote: >>>>> >>>>>> On Fri, Mar 29, 2019 at 12:54 PM Michael Luckey <[email protected]> >>>>>> wrote: >>>>>> > >>>>>> > Really like the idea of improving here. >>>>>> > >>>>>> > Unfortunately, I haven't worked with python on that scale yet, so >>>>>> bear with my naive understandings in this regard. If I understand >>>>>> correctly, the suggestion will result in a couple of projects consisting >>>>>> only of a build,gradle file to kind of workaround on gradles decision not >>>>>> to parallelize within projects, right? In consequence, this also kind of >>>>>> decouples projects from their content - they stuff which constitutes the >>>>>> project - and forces the build file to 'somehow reach out to content of >>>>>> other (only python root?) projects. E.g couples projects. This somehow >>>>>> 'feels non natural' to me. But, of course, might be the path to go. As I >>>>>> said before, never worked on python on that scale. >>>>>> >>>>>> It feels a bit odd to me as well. Is it possible to have multiple >>>>>> projects per directory (e.g. a suite of testing ones) rather than >>>>>> having to break things up like this, especially if the goal is >>>>>> primarily to get parallel running of tests? Especially if we could >>>>>> automatically create the cross-product rather than manually? There >>>>>> also seems to be some redundancy with what tox is doing here. >>>>>> >>>>> >>>>> Not sure, whether I understand correctly. But I do not think that's >>>>> possible. If we are going to do some cross-product, we are probably better >>>>> of doing that on tasks, e.g. by leveraging task rules or programmatically >>>>> adding tasks (which is already done in parts). Of course, this will not >>>>> help with parallelisation (but might enable that, see below). >>>>> >>>>> >>>>>> >>>>>> > But I believe to remember Robert talking about using in project >>>>>> parallelisation for his development. Is this something which could also >>>>>> work on CI? Of course, that will not help with different python versions, >>>>>> but maybe that could be solved also by gradles variants which are >>>>>> introduced in 5.3 - definitely need some time to investigate the >>>>>> possibilities here. On first sight it feels like lots of duplication to >>>>>> create 'builds' for any python version. Or wouldn't that be the case? >>>>>> > >>>>>> > And another naive thought on my side, isn't that non >>>>>> parallelizability also caused by the monolithic setup of the python code >>>>>> base? E.g. if I understand correctly, java sdk is split into >>>>>> core/runners/ios etc, each encapsulate into full blown projects, i.e. >>>>>> buckets of sources, tests and build file. Would it technically possible >>>>>> to >>>>>> do something similar with python? I assume that being discussed before >>>>>> and >>>>>> teared apart, but couldn't find on mailing list. >>>>>> >>>>>> Neither the culture nor the tooling of Python supports lots of >>>>>> interdependent "sub-packages" for a single project--at least not >>>>>> something smaller than one would want to deploy to Pypi. So while one >>>>>> could do this, it'd be going against the grain. There are also much >>>>>> lower-hanging opportunities for parallelization (e.g. running the test >>>>>> suites for separate python versions in parallel). >>>>>> >>>>>> It's not very natural (as I understand it) with Go either. If we're >>>>>> talking directory re-organization, I think it would make sense to >>>>>> consider having top-level java, python, go, ... next to model, >>>>>> website, etc. >>>>>> >>>>> >>>>> Yes. We shouldn't work against common culture/practices, but try to >>>>> embrace native tooling and add support where required. >>>>> >>>>> To reiterate on parallelisation, there are (at least) three >>>>> opportunities: >>>>> >>>>> 1. Parallelise on test level. For python, this is detox? >>>>> >>>> >>>> This is actually 2 levels. :) >>>> 1a. Parallelise at the nosetest level - run unit tests in parallel in a >>>> single tox environment. (I have a PR in progress to migrate to pytest, and >>>> we should be able to do file-level parallelism provided we solve pickling >>>> issues.) >>>> 1b. Parallelise at the tox environment level, e..g, somehow running the >>>> multiple tox environments (py27,py27-cython,py35,...) in parallel. >>>> >>>> >>>>> 2. Parallelise on Gradle project level (--parallel option) >>>>> 3. Parallelise on CI level >>>>> >>>>> So what I ve done before, if 1. does not help, nor 2. cause project is >>>>> 'just to big' was 3, i.e. splitting on CI level. So the simplest thing I >>>>> could imagine right now would be - as suggested above - to split >>>>> pythonPreCommit into something like pythonX_YPrecommit, which then runs >>>>> those different python versions in parallel. Of course, that could be done >>>>> ad infinitum by splitting further into runners, IOs whatever. >>>>> >>>>> OTOH, we do already have tons of jobs running as we seem to map Gradle >>>>> tasks to Jenkins jobs. So it might be more appropriate to leverage CI >>>>> parallelization on jenkins pipeline level. Something like creating >>>>> pythonPrecommit as a pipeline, which itself runs several steps in >>>>> parallel. >>>>> Did not contemplate on the impacts, though, on PRs and phrase triggering. >>>>> Of course, we might need to improve on our 'verbosity' as currently it >>>>> seems not always clear what command line actually triggered etc. With this >>>>> approach, it seems possible to run tasks within same Gradle project in >>>>> parallel with whatever granularity we might want to choose. (Of course we >>>>> do add overhead here, but that's probably always the case by doing things >>>>> in parallel) >>>>> >>>>> Of course 3. would not help on parallelisation on developers machine, >>>>> but I never had a problem with that. >>>>> >>>>> >>>>>> >>>>>> > And as a last thought, will usage of pygradle help with better >>>>>> python/gradle integration? Currently, we mainly use gradle to call into >>>>>> shell scripts, which doesn't help gradle nor probably pythons tooling to >>>>>> do >>>>>> the job very well? But deeper integration might cause problems on python >>>>>> dev side, dunno :( >>>>>> >>>>>> Possibly. >>>>>> >>>>>> Are there any Python developers that primarily use the gradle >>>>>> commands? Personally, I only use them if I'm using Java (or sometimes >>>>>> work that is a mix of Java and Python, e.g. the Python-on-Flink >>>>>> tests). Otherwise I use tox, or "python setup.py test [-s ...]" >>>>>> directly. Gradle primarily has value as a top-level orchestration (in >>>>>> particular for CI) and easy way for those who only touch Python >>>>>> occasionally to run all the tests. If that's the case, optimizing our >>>>>> gradle scripts for CI seems best. >>>>>> >>>>> >>>>> My impression till today is, that nobody is actually using Gradle on >>>>> daily python development. So I assume it is currently used on CI only. But >>>>> I believe gradle could add value for devs also. We just need to improve >>>>> our >>>>> current integration. And to be very clear, gradle should not replace >>>>> python >>>>> tooling, but support usage in a way which looks familiar to any python >>>>> dev. >>>>> And probably render some shell scripting obsolete. >>>>> >>>>> >>>>>> >>>>>> > On Thu, Mar 28, 2019 at 6:37 PM Mark Liu <[email protected]> >>>>>> wrote: >>>>>> >> >>>>>> >> Thank you Ahmet. Answer your questions below: >>>>>> >> >>>>>> >>> >>>>>> >>> - Could you comment on what kind of parallelization we will gain >>>>>> by this? In terms of real numbers, how would this affect build and test >>>>>> times? >>>>>> >> >>>>>> >> >>>>>> >> The proposal is based on Gradle parallel execution: "you can force >>>>>> Gradle to execute tasks in parallel as long as those tasks are in >>>>>> different >>>>>> projects". In Beam, project is declared per build.gradle file and >>>>>> registered in settings.gradle. Tasks that are included in single Gradle >>>>>> execution will run in parallel only if they are declared in separate >>>>>> build.gradle files. >>>>>> >> >>>>>> >> An example of applying parallel is beam_PreCommit_Python job which >>>>>> runs :pythonPreCommit task that contains tasks distributed in 4 >>>>>> build.gradle. The execution graph looks like >>>>>> https://scans.gradle.com/s/4frpmto6o7hto/timeline: >>>>>> >> >>>>>> >> Without this proposal, all tasks will run in sequential which can >>>>>> be ~2x longer. If more py36 and py37 tests added in the future, things >>>>>> will >>>>>> be even worse. >>>>>> >> >>>>>> >>> - I am guessing this will reduce complexity. Is it possible to >>>>>> quantify the improvement related to this? >>>>>> >> >>>>>> >> >>>>>> >> The general code complexity of function/method/property may not >>>>>> change here since we basically group tasks in a different way without >>>>>> changing inside logic. I don't know if there is any tool to measure >>>>>> Gradle >>>>>> build complexity. Would love to try if there is. >>>>>> >> >>>>>> >>> >>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on. >>>>>> Just want to clarify this. In either case, would you need help? >>>>>> >> >>>>>> >> >>>>>> >> Yes, I'd love to take on major refactor works. At the same time, >>>>>> I'll create jira for each kind of tests (like flink/protable/hdfs tests) >>>>>> in >>>>>> sdks/python/build.gradle to move into test-suites. Test owners or anyone >>>>>> interested to this work are welcome to contribute! >>>>>> >> >>>>>> >> Mark >>>>>> >> >>>>>> >> On Wed, Mar 27, 2019 at 3:53 PM Ahmet Altay <[email protected]> >>>>>> wrote: >>>>>> >>> >>>>>> >>> This sounds good to me. Thank you for doing this. Few questions: >>>>>> >>> - Could you comment on what kind of parallelization we will gain >>>>>> by this? In terms of real numbers, how would this affect build and test >>>>>> times? >>>>>> >>> - I am guessing this will reduce complexity. Is it possible to >>>>>> quantify the improvement related to this? >>>>>> >>> - Beyond the proposal, I am assuming you are willing to work on. >>>>>> Just want to clarify this. In either case, would you need help? >>>>>> >>> >>>>>> >>> Thank you, >>>>>> >>> Ahmet >>>>>> >>> >>>>>> >>> On Wed, Mar 27, 2019 at 10:19 AM Mark Liu <[email protected]> >>>>>> wrote: >>>>>> >>>> >>>>>> >>>> Hi Python SDK Developers, >>>>>> >>>> >>>>>> >>>> You may notice that Gradle files changed a lot recently as >>>>>> parallelization applied to Python tests and more python versions were >>>>>> enabled in testing. There are tricks over the build scripts and tests are >>>>>> grown naturally and distributed under sdks/python, which caused frictions >>>>>> (like rollback PR-8059). >>>>>> >>>> >>>>>> >>>> Thus, I created BEAM-6907 and would like to initiate some works >>>>>> to cleanup and standardize Gradle structure in Python SDK. In general, I >>>>>> think we want to: >>>>>> >>>> >>>>>> >>>> - Apply parallel execution >>>>>> >>>> - Share common tasks >>>>>> >>>> - Centralize test related tasks >>>>>> >>>> - Have a clear Gradle structure for projects/tasks >>>>>> >>>> >>>>>> >>>> This is Gradle directory structure I proposed: >>>>>> >>>> >>>>>> >>>> sdks/python/ >>>>>> >>>> >>>>>> >>>> build.gradle --> hold builds, snapshot, analytic tasks >>>>>> >>>> test-suites/ --> all pre/post/VR test suites under here >>>>>> >>>> >>>>>> >>>> README.md >>>>>> >>>> >>>>>> >>>> dataflow/ --> grouped by runner or unit test (tox) >>>>>> >>>> >>>>>> >>>> py27/ --> grouped by py version >>>>>> >>>> >>>>>> >>>> build.gradle >>>>>> >>>> >>>>>> >>>> py35/ >>>>>> >>>> >>>>>> >>>> ... >>>>>> >>>> >>>>>> >>>> direct/ >>>>>> >>>> >>>>>> >>>> py27/ >>>>>> >>>> >>>>>> >>>> ... >>>>>> >>>> >>>>>> >>>> flink/ >>>>>> >>>> >>>>>> >>>> tox/ >>>>>> >>>> ... >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> The ideas are: >>>>>> >>>> - Only keep builds, snapshot and analytic jobs in >>>>>> sdks/python/build.gradle >>>>>> >>>> - Move all test related tasks to sdks/python/test-suites/ >>>>>> >>>> - In sdks/python/test-suites, we first group by runners, unit >>>>>> test or other testing that can't fit to them, and then group by py >>>>>> versions >>>>>> if needed. >>>>>> >>>> - An example of ../test-suites/../py35/build.gradle is this. >>>>>> >>>> >>>>>> >>>> Please feel free to explore existing Gradle scripts in Python >>>>>> SDK and bring any thoughts on this proposal if you have. >>>>>> >>>> >>>>>> >>>> Thanks! >>>>>> >>>> Mark >>>>> >>>>> [image: https://ml6.eu] >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__ml6.eu_&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=yd_him24QhfROm7uRZLbfSsUHaA68_8FMl6s1MgT5sM&e=> >> >> >> * Frederik Bode* >> >> ML6 Ghent >> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.google.be_maps_place_ML6_-4051.037408-2C3.7044893-2C17z_data-3D-213m1-214b1-214m5-213m4-211s0x47c37161feeca14b-3A0xb8f72585fdd21c90-218m2-213d51.037408-214d3.706678-3Fhl-3Dnl&d=DwMFaQ&c=fP4tf--1dS0biCFlB0saz0I0kjO5v7-GLPtvShAo4cc&r=pVqtPRV3xHPbewK5Cnv1OugvWbha6Poxqp5n4ssIg74&m=FLed4d0BjB5-R2hz9IHrat47LfDj7YhMNHbEVeZ0dw8&s=26TZxPGXg0A_mqgeiw1lMeZYekpkExBAZ5MpavpUZmw&e=> >> +32 4 92 78 96 18 >> >> >> **** DISCLAIMER **** >> >> This email and any files transmitted with it are confidential and >> intended solely for the use of the individual or entity to whom they are >> addressed. If you have received this email in error please notify the >> system manager. This message contains confidential information and is >> intended only for the individual named. If you are not the named addressee >> you should not disseminate, distribute or copy this e-mail. Please notify >> the sender immediately by e-mail ifyou have received this e-mail by mistake >> and delete this e-mail from your system. If you are not the intended >> recipient you are notified that disclosing, copying, distributing or taking >> any action in reliance on the contents of this information is strictly >> prohibited. >> >
