[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

mccheah Tue, 19 Dec 2017 14:08:58 -0800

Github user mccheah commented on the issue:

    https://github.com/apache/spark/pull/19954
  
    > I disagree. For example, just to pick one example I spotted while 
glancing at the code quickly. There are many steps that take a 
kubernetesResourceNamePrefix as a constructor argument. Then the code that 
instantiates those steps needs to know each of those steps, and that they take 
that prefix as an argument to their constructor. I don't see that as 
abstracting much.
    
    > Instead, for example, if you encapsulate the state of a "kubernetes 
submission" into some helper type that wraps SparkConf and other common things 
needed to submit an app, those steps could potentially just take that helper 
type as an argument to the abstract method that does the processing (e.g. 
configureDriver). There's less step-specific logic needed everywhere, and the 
list of arguments to constructors and methods shrinks a lot.
    
    @vanzin An excellent idea - this is very much in line with the e.g. 
`SQLConf` that's in Spark-SQL. @liyinan926 it would be great to introduce this 
and swap out our parameters in various places. Our original methodology was to 
pass parameters down that would normally need to be repeatedly retrieved by the 
same config key from the `SparkConf` over and over again, but we can make an 
abstraction for that. This basically moves us from parameter primitives to 
parameter objects, an elegant solution.
    
    > That's the thing. That's not an abstraction, that's just code that lives 
in a separate method. You can test the "method that creates a list of steps" 
separately too without having to stash it in a completely separate type and 
create more coupling in your code. Abstraction is meant to reduce coupling, not 
increase it ...
    
    > ... You just have code living in separate classes with the goal of making 
testing easier, but you can achieve the same thing without the code living in 
separate classes, and things would probably be simpler overall.
    
    A design we aimed for here was to make every class in the code do exactly 
one thing, and the test for that class thus can focus solely on testing that 
single thing. Having the `@VisibleForTesting` annotation with public methods or 
something similar to test multiple methods in the same class makes that single 
unit test class difficult to read. Or we would have to split up the unit test 
for the given class to multiple unit test classes, but then it becomes 
difficult to track down all of the classes that come together to test that 
class. So although I agree that we're not creating abstractions per se in the 
sense that we're not creating generic APIs that have multiple implementations, 
we are creating helper objects and submodules to decompose the submission 
process such that we have smaller chunks to test at a time.
    
    Therefore a key hallmark for why the code is decomposed the way it is, is 
to make the tests as readable as possible, that each test class is honed in on 
covering all code paths in a single unit. Again - open to ideas of how to make 
that more readable in the main code.
    
    > An abstraction for an orchestrator would mean that there is a code path 
that can take a different orchestrator to achieve different things. For 
example, if you had code that started containers, and given a different 
orchestrator, could start either a driver or an executor, or even be used by 
tests to mock some behavior you want to verify.
    
    I'm interested in exploring this idea - will have to think a bit more on 
what that would specifically look like.
    
    > I think the biggest thing that makes reviewing this PR challenging is the 
way the code is structured, not the amount of changes. This PR is actually both 
shorter and conceptually simpler than the first two big ones, IMO. Unless we do 
a thorough refactoring of the code, I personally don't think splitting into 
smaller ones will help much on reducing the confusion. Splitting also risks 
making half baked feature into 2.3, which I personally don't think is a good 
idea.
    
    @liyinan926 the code structure is the main issue, yes, but we can tackle 
the code structure more effectively by having a better decomposition of the 
review process as well. We have these three distinct components which are 
relatively independent. We can therefore separate out the three pieces and 
consider the architecture for each of them individually. Whereas here we may 
want to tackle how to change the submission and the executor pod configuration, 
but what if in our focus on those we miss crucial issues in the design of the 
init container? Breaking up the PR will allow us to make each PR focus on the 
architectural considerations of each part more thoroughly, and it makes it 
easier to follow the discussion surrounding each part.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19954: [SPARK-22757][Kubernetes] Enable use of remote dependenc...

Reply via email to