[ 
https://issues.apache.org/jira/browse/SYSTEMML-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280832#comment-15280832
 ] 

Niketan Pansare commented on SYSTEMML-593:
------------------------------------------

Thanks [~deron] for creating the design document. It improves the usability of 
MLContext a lot.

I like the common interface "in" that allows users to pass both data as well as 
command-line arguments. I also like that we use $prefix for commandline 
variables in the "in" method. Thereby, in(String, RDD/DataFrame) maps to 
registerInput and in(String, boolean/double/float/int/string) maps to 
command-line arguments. I also like that this design avoids the need to cast 
boolean/double/float/int into String.

I also like the Script abstraction as it avoids overloaded execute methods (for 
example: PyDML, DML, ...).

Few thoughts/suggestions:
1. Current MLContext allows the users to pass RDD/DataFrame to the script using 
"registerInput". In the proposed document, we pass the RDD/DataFrame through 
".in(...)". In addition, registerInput method allows for passing the format and 
the meta-data information. In some cases, the format is required but meta-data 
is optional and in some other case both are required. We need to add 
appropriate guards in our new MLContext.
For example: we should not support `script.in("A", sc.textFile("m.csv"))` as 
RDD<String> can refer to either "csv" or "text" format. Also, `script.in("A", 
sc.textFile("m.text"), "text")` should throw an error stating meta-data is 
required.

2.  The DML language semantics should be respected. For example: if script has 
following line `X = read($fileX)`, then providing .in("X", ...), but not 
.in("$fileX", ...) should throw an error.

3. Please remember that DataFrame is unordered collection and we return matrix 
which is an ordered structure. So, please remember to return DataFrame with an 
"ID" column as we do in our current MLOutput class, else we are potentially 
breaking the contract. 

4. Please support following different types of DataFrame:
- With an ID column and one DF column of type double for every column of 
matrix. This is safe way for user to pass a DataFrame to SystemML and still be 
able to do pre-processing.
- Without an ID column, but with one DF column of type double for every column 
of matrix.  This is potentially unsafe and user ensures that rows are sorted.
- With an ID column and DF with a column of Vector DataType. This is often used 
in MLPipeline wrappers.
- Without an ID column, but with DF with a column of Vector DataType. This is 
often used in MLPipeline wrappers.

5. With exception of DataFrame, all the RDDs that we pass map to the format we 
support in read(): RDD<String>/JavaRDD<String>/JavaPairRDD<LongWritable, 
Text>/... for csv and text format + RDD<MI, MB>/JavaPairRDD<MI,MB> for 
binaryblock. For non-read formats, we implement RDDConverterUtils.

Please support all the read-formats either directly or via an abstraction (for 
example: proposed BinaryBlockMatrix which is wrapper of JavaPairRDD<MI,MB> and 
MC). In particular, users might prefer to stick with BinaryBlockMatrix if they 
want to pass it to another DML script but might want DataFrame if they want to 
apply SQL. Why ? For extremely wide matrices, DataFrame is extremely 
inefficient format. 

An alternate suggestion: You can only support registering one type of 
DataFrame/RDD and have many constructors/factory methods for them. For example: 
Please see org.apache.sysml.api.MLMatrix (for reference implementation of 
BinaryBlockMatrix) which essentially is a two column DataFrame that supports 
simple Matrix algebra. It also fits well into Spark Datasource API: 
ml.read(sqlContext, "W_small.mtx", "binary").

[~reinwald] [~mboehm7] [~mwdus...@us.ibm.com]

> MLContext Redesign
> ------------------
>
>                 Key: SYSTEMML-593
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-593
>             Project: SystemML
>          Issue Type: Improvement
>          Components: APIs
>            Reporter: Deron Eriksson
>            Assignee: Deron Eriksson
>         Attachments: Design Document - MLContext API Redesign.pdf
>
>
> This JIRA proposes a redesign of the Java MLContext API with several goals:
> •     Simplify the user experience
> •     Encapsulate primary entities using object-oriented concepts
> •     Make API extensible for external users
> •     Make API extensible for SystemML developers
> •     Locate all user-interaction classes, interfaces, etc under a single API 
> package
> •     Extensive Javadocs for all classes in the API
> •     Potentially fold JMLC API into MLContext so as to have a single 
> programmatic API



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to