Hi Matthias,

On Sat, Apr 2, 2016 at 9:34 PM, Matthias Boehm <mbo...@us.ibm.com> wrote:

>
> (1) Simplicity: Given that the primary usecase of MLContext calls a script
> exactly once, I'm wondering if the separation into Script, ScriptFactory,
> ScriptExecutor and MLContext adds unnecessary complexity by requiring more
> code to setup. It would be great to see old vs new examples side by side.
>
>

With the introduction of notebooks, we might need to be calling scripts
more than once in the near future.


The current API is very procedural. For example, there are 24 execute() and
executeScript() methods on the MLContext class. Encapsulating concepts such
as 'Script' can bring a significant amount of power, flexibility, and
actually simplicity to the system. These 24 execute methods can be replaced
by a single execute(Script script) method on MLContext. We can also include
a second execute method, execute(Script script, ScriptExecutor
scriptExecutor) so that an advanced user can easily modify the execution
steps. The existing API is not extensible in this way. A user would need to
modify the source code of the MLContext
executeUsingSimplifiedCompilationChain in order to do this, whereas with
this redesign a user can subclass ScriptExecutor and modify it as needed.
If a user wants to use the default execution, the user can just call the
MLContext execute(Script script) method.


This is an example of simplifying the end user experience (replace 24
execute methods with 1 execute method). However, it is also nice to add
extensibility (via a second execute method that takes a ScriptExecutor) for
advanced cases. A normal user probably would not really care about
ScriptExecutor and wouldn't need to use it.


As another potential benefit of Script objects, we could conceivably do
things like encapsulate a namespace into the script object, or have a
script object encapsulate a list of other script objects, etc.


As a further example of simplicity, the current 18 registerInput methods on
MLContext can be replaced by a single Script in(String str, Object obj)
method. Chaining method calls by returning a Script object from the method
call (script.in("$a", 5).in("$b", true)...) is a convenient way of setting
multiple inputs in a single line of code consisting of multiple method
calls.


In terms of user interaction, at its most basic, a script consists of some
text (string), has a type (DML or PYDML), can have inputs, and can have
outputs. If things are broken down any further and we lose the
encapsulation, then we have a procedural API with lots of registerInputs
and executes, for example.


WRT factories, they can be a very useful design pattern. Here's an example
of creating a DML script from a String, a DML script from a file, a PYDML
script from a URL, and a PYDML script from an input stream. That's 4
scripts from four different sources in four lines of code with no
boilerplate or boolean flags.


Script scr1 = ScriptFactory.createDMLScriptFromString("print('hi');");

Script scr2 = ScriptFactory.createDMLScriptFromFile("ex.dml");

Script scr3 = ScriptFactory.createPYDMLScriptFromUrl("
http://example.com/alg.pydml";);

Script scr4 = ScriptFactory.createPYDMLScriptFromInputStream(myInputStream);


As for other code examples, we can replace MLContext's 18 registerInputs:


registerInput(String, DataFrame)

registerInput(String, DataFrame, boolean)

registerInput(String, JavaPairRDD<LongWritable, Text>, String, long, long,
long, FileFormatProperties)

registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long)

registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long,
int, int)

registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>, long, long,
int, int, long)

registerInput(String, JavaPairRDD<MatrixIndexes, MatrixBlock>,
MatrixCharacteristics)

registerInput(String, JavaRDD<String>, String)

registerInput(String, JavaRDD<String>, String, boolean, String, boolean,
double)

registerInput(String, JavaRDD<String>, String, boolean, String, boolean,
double, long, long, long)

registerInput(String, JavaRDD<String>, String, long, long)

registerInput(String, JavaRDD<String>, String, long, long, long)

registerInput(String, MLMatrix)

registerInput(String, RDD<String>, String)

registerInput(String, RDD<String>, String, boolean, String, boolean, double)

registerInput(String, RDD<String>, String, boolean, String, boolean,
double, long, long, long)

registerInput(String, RDD<String>, String, long, long)

registerInput(String, RDD<String>, String, long, long, long)


with one Script in(String, Object) method (and perhaps another in() method
for a Scala immutable Map).


We can replace the existing 24 execute and executeScript methods on
MLContext:


execute(String)

execute(String, ArrayList<String>)

execute(String, ArrayList<String>, boolean)

execute(String, ArrayList<String>, boolean, String)

execute(String, ArrayList<String>, String)

execute(String, boolean)

execute(String, boolean, String)

execute(String, HashMap<String, String>)

execute(String, HashMap<String, String>, boolean)

execute(String, HashMap<String, String>, boolean, String)

execute(String, HashMap<String, String>, String)

execute(String, Map<String, String>)

execute(String, Map<String, String>, boolean)

execute(String, String)

execute(String, String[])

execute(String, String[], boolean)

execute(String, String[], boolean, String)

execute(String, String[], String)

executeScript(String)

executeScript(String, HashMap<String, String>)

executeScript(String, HashMap<String, String>, String)

executeScript(String, Map<String, String>)

executeScript(String, Map<String, String>, String)

executeScript(String, String)


with one MLContext execute(Script script) method.


I think the following example from the design doc is quite concise. We have
(1) a String of DML, (2) we create a Script object for it, (3) we set
inputs and outputs (both input parameters and binding to variables), (4) we
execute the script, (5) we get back the results.


1) String str = "x=$X; A=read($Ain); B=A+x; write(B, 'temp');";

2) Script script = ScriptFactory.dml(str);

3) script.in("$X", 10).in("A", sc.textFile("m.csv")).regOut("B");

4) ml.execute(script);

5) BinaryBlockMatrix bbm = script.out("B");


These 5 steps can even be combined into a single line through method
chaining, which is interesting but not very readable. However, we could
combine 1&2 above and 4&5 above to provide a 3-line example that's still
readable.


1) Script script = ScriptFactory.dml("x=$X; A=read($Ain); B=A+x; write(B,
'temp');");

2) script.in("$X", 10).in("A", sc.textFile("m.csv")).regOut("B");

3) BinaryBlockMatrix bbm = ml.execute(script).out("B");


The only thing that seems 'extra' to me is registering the output ("B"). I
don't know if it's possible internally, but it might be nice to not require
a user to register an output.


The equivalent using the current API would be something like:


ml.reset();

ml.registerInput("A", sc.textFile("m.csv"), "csv");

HashMap<String, String> cmdLineArgs = new HashMap<String, String>();

cmdLineArgs.put("X", "10");

cmdLineArgs.put("Ain", " ");

ml.registerOutput("B");

MLOutput output = ml.executeScript("x=$X; A=read($Ain); B=A+x; write(B,
'temp');", cmdLineArgs);

JavaPairRDD<MatrixIndexes, MatrixBlock> binaryBlockedRDD =
output.getBinaryBlockedRDD("B");


Although these examples are basically logically equivalent and do the same
thing, personally I have a much easier time conceptualizing the example
using a Script object.


Deron

Reply via email to