Hi Matthias, On Sat, Apr 2, 2016 at 9:34 PM, Matthias Boehm <mbo...@us.ibm.com> wrote:
> > (1) Simplicity: Given that the primary usecase of MLContext calls a script > exactly once, I'm wondering if the separation into Script, ScriptFactory, > ScriptExecutor and MLContext adds unnecessary complexity by requiring more > code to setup. It would be great to see old vs new examples side by side. > > With the introduction of notebooks, we might need to be calling scripts more than once in the near future. The current API is very procedural. For example, there are 24 execute() and executeScript() methods on the MLContext class. Encapsulating concepts such as 'Script' can bring a significant amount of power, flexibility, and actually simplicity to the system. These 24 execute methods can be replaced by a single execute(Script script) method on MLContext. We can also include a second execute method, execute(Script script, ScriptExecutor scriptExecutor) so that an advanced user can easily modify the execution steps. The existing API is not extensible in this way. A user would need to modify the source code of the MLContext executeUsingSimplifiedCompilationChain in order to do this, whereas with this redesign a user can subclass ScriptExecutor and modify it as needed. If a user wants to use the default execution, the user can just call the MLContext execute(Script script) method. This is an example of simplifying the end user experience (replace 24 execute methods with 1 execute method). However, it is also nice to add extensibility (via a second execute method that takes a ScriptExecutor) for advanced cases. A normal user probably would not really care about ScriptExecutor and wouldn't need to use it. As another potential benefit of Script objects, we could conceivably do things like encapsulate a namespace into the script object, or have a script object encapsulate a list of other script objects, etc. As a further example of simplicity, the current 18 registerInput methods on MLContext can be replaced by a single Script in(String str, Object obj) method. Chaining method calls by returning a Script object from the method call (script.in("$a", 5).in("$b", true)...) is a convenient way of setting multiple inputs in a single line of code consisting of multiple method calls. In terms of user interaction, at its most basic, a script consists of some text (string), has a type (DML or PYDML), can have inputs, and can have outputs. If things are broken down any further and we lose the encapsulation, then we have a procedural API with lots of registerInputs and executes, for example. WRT factories, they can be a very useful design pattern. Here's an example of creating a DML script from a String, a DML script from a file, a PYDML script from a URL, and a PYDML script from an input stream. That's 4 scripts from four different sources in four lines of code with no boilerplate or boolean flags. Script scr1 = ScriptFactory.createDMLScriptFromString("print('hi');"); Script scr2 = ScriptFactory.createDMLScriptFromFile("ex.dml"); Script scr3 = ScriptFactory.createPYDMLScriptFromUrl(" http://example.com/alg.pydml"); Script scr4 = ScriptFactory.createPYDMLScriptFromInputStream(myInputStream); As for other code examples, we can replace MLContext's 18 registerInputs: registerInput(String, DataFrame) registerInput(String, DataFrame, boolean) registerInput(String, JavaPairRDD<LongWritable, Text>, String, long, long, long, FileFormatProperties) registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long) registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long, int, int) registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long, int, int, long) registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, MatrixCharacteristics) registerInput(String, JavaRDD<String>, String) registerInput(String, JavaRDD<String>, String, boolean, String, boolean, double) registerInput(String, JavaRDD<String>, String, boolean, String, boolean, double, long, long, long) registerInput(String, JavaRDD<String>, String, long, long) registerInput(String, JavaRDD<String>, String, long, long, long) registerInput(String, MLMatrix) registerInput(String, RDD<String>, String) registerInput(String, RDD<String>, String, boolean, String, boolean, double) registerInput(String, RDD<String>, String, boolean, String, boolean, double, long, long, long) registerInput(String, RDD<String>, String, long, long) registerInput(String, RDD<String>, String, long, long, long) with one Script in(String, Object) method (and perhaps another in() method for a Scala immutable Map). We can replace the existing 24 execute and executeScript methods on MLContext: execute(String) execute(String, ArrayList<String>) execute(String, ArrayList<String>, boolean) execute(String, ArrayList<String>, boolean, String) execute(String, ArrayList<String>, String) execute(String, boolean) execute(String, boolean, String) execute(String, HashMap<String, String>) execute(String, HashMap<String, String>, boolean) execute(String, HashMap<String, String>, boolean, String) execute(String, HashMap<String, String>, String) execute(String, Map<String, String>) execute(String, Map<String, String>, boolean) execute(String, String) execute(String, String[]) execute(String, String[], boolean) execute(String, String[], boolean, String) execute(String, String[], String) executeScript(String) executeScript(String, HashMap<String, String>) executeScript(String, HashMap<String, String>, String) executeScript(String, Map<String, String>) executeScript(String, Map<String, String>, String) executeScript(String, String) with one MLContext execute(Script script) method. I think the following example from the design doc is quite concise. We have (1) a String of DML, (2) we create a Script object for it, (3) we set inputs and outputs (both input parameters and binding to variables), (4) we execute the script, (5) we get back the results. 1) String str = "x=$X; A=read($Ain); B=A+x; write(B, 'temp');"; 2) Script script = ScriptFactory.dml(str); 3) script.in("$X", 10).in("A", sc.textFile("m.csv")).regOut("B"); 4) ml.execute(script); 5) BinaryBlockMatrix bbm = script.out("B"); These 5 steps can even be combined into a single line through method chaining, which is interesting but not very readable. However, we could combine 1&2 above and 4&5 above to provide a 3-line example that's still readable. 1) Script script = ScriptFactory.dml("x=$X; A=read($Ain); B=A+x; write(B, 'temp');"); 2) script.in("$X", 10).in("A", sc.textFile("m.csv")).regOut("B"); 3) BinaryBlockMatrix bbm = ml.execute(script).out("B"); The only thing that seems 'extra' to me is registering the output ("B"). I don't know if it's possible internally, but it might be nice to not require a user to register an output. The equivalent using the current API would be something like: ml.reset(); ml.registerInput("A", sc.textFile("m.csv"), "csv"); HashMap<String, String> cmdLineArgs = new HashMap<String, String>(); cmdLineArgs.put("X", "10"); cmdLineArgs.put("Ain", " "); ml.registerOutput("B"); MLOutput output = ml.executeScript("x=$X; A=read($Ain); B=A+x; write(B, 'temp');", cmdLineArgs); JavaPairRDD<MatrixIndexes, MatrixBlock> binaryBlockedRDD = output.getBinaryBlockedRDD("B"); Although these examples are basically logically equivalent and do the same thing, personally I have a much easier time conceptualizing the example using a Script object. Deron