Thank you Niketan for providing such useful information. The RDDConverterUtilsExt javadoc example is great.
The MLContext API has a tremendous amount of potential given that is has such clean integration with Spark (for example, it's so easy to create an MLContext from a SparkContext in the Spark Shell). I'm really interested in seeing how data scientists and developers embrace it in the coming months. Deron On Mon, Dec 7, 2015 at 3:31 PM, Niketan Pansare <[email protected]> wrote: > Thanks Deron for your response :) > > Sourav: Few additional comments: > 1. MLContext allows the users to pass RDDs to SystemML and MLOutput allows > them to fetch the result RDD after the execution of a DML script. > > 2. MLContext exposes registerInput("variableName", RDD) interface, while > MLOutput has get..("variableName") methods. Eg: getDF, getBinaryBlockedRDD, > ... > > 3. With exception of DataFrame, the RDDs supported by these classes mirror > the RDDs in the symbol table and the format supported by read()/write() > built-in functions. Following types of RDDs are supported by these classes: > a. Binary blocked RDD (JavaPairRDD<MatrixIndexes, MatrixBlock>) => > corresponds to format="binary" > b. String-based RDD (JavaRDD<String>) => corresponds to format="csv" or > format="text" > c. DataFrame > > See > http://apache.github.io/incubator-systemml/dml-language-reference.html#readwrite-built-in-functions > for more details about the formats supported by read()/write() built-in > functions. > > 4. For all other types of RDDs, we decided to expose them through > converter utils: > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtils.java > > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/instructions/spark/utils/RDDConverterUtilsExt.java > > 5. The utility functions in RDDConverterUtilsExt are not tested for > performance and robustness. Once they are tested, they will be moved into > RDDConverterUtils. Most of these utils have javadocs within the code and we > will add both usage guide and external javadoc for them. Following types of > conversions are supported by the converter utils: > a. CoordinateMatrix to Binary blocked RDD (See > coordinateMatrixToBinaryBlock in RDDConverterUtilsExt). > b. Binary blocked RDD to String RDD. > c. DataFrame with a column with Vector UDT to Binary Block and viceversa. > This is useful while working with RDD<LabelPoints>. (See > vectorDataFrameToBinaryBlock and binaryBlockToVectorDataFrame in > RDDConverterUtilsExt). > d. DataFrame with double columns (See dataFrameToBinaryBlock in > RDDConverterUtilsExt). Since DataFrame/RDD is a collection not a > indexed/ordered sequence (at least not at API level), an ID column is > inserted by MLOutput to denote the row index. > e. Binary block to Labeled points (See binaryBlockToLabeledPoints in > RDDConverterUtils). > f. Conversion between text/cell/csv formats to and from Binary blocked RDD > (See RDDConverterUtils). > > 6. MLContext interface is Scala compatible i.e. we support both JavaRDD > and RDD, JavaSparkContext and SparkContext, java.util.HashMap and > scala.collection.immutable.Map, and so on. > > 7. MatrixCharacteristics is used to provide the metadata information (such > as number of rows, number of columns, block row length, block column length > and number of non-zeros) of a RDD to the SystemML's optimizer. In some > cases, it is required (for example: text, binary format) while in some > cases, it can be skipped (for example: csv, dataframe). MLContext exposes > convenient wrappers such as *void* registerInput(String varName, > JavaPairRDD<MatrixIndexes,MatrixBlock> rdd, *long* rlen, *long* clen, > *int* brlen, *int* bclen) to avoid creating MatrixCharacteristics. Here > is the source code if you are interested: > https://github.com/apache/incubator-systemml/blob/master/src/main/java/org/apache/sysml/runtime/matrix/MatrixCharacteristics.java > > A good example of using MatrixCharacteristics and converter utils is > provided in RDDConverterUtilsExt's javadoc: > * import > org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtilsExt > * import org.apache.sysml.runtime.matrix.MatrixCharacteristics > * import org.apache.spark.api.java.JavaSparkContext > * import org.apache.spark.mllib.linalg.distributed.MatrixEntry > * import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix > * *val* matRDD = sc.textFile("ratings.text").map(_.split(" ")).map(x => > new MatrixEntry(x(0).toLong, x(1).toLong, x(2).toDouble)).filter(_.value != > 0).cache > * require(matRDD.filter(x => x.i == 0 || x.j == 0).count == 0, "Expected 1 > -based ratings file") > * *val* *nnz* = matRDD.count > * *val* numRows = matRDD.map(_.i).max > * *val* numCols = matRDD.map(_.j).max > * *val* coordinateMatrix = new CoordinateMatrix(matRDD, numRows, numCols) > * *val* *mc* = new MatrixCharacteristics(numRows, numCols, 1000, 1000, > *nnz*) > * *val* binBlocks = > RDDConverterUtilsExt.coordinateMatrixToBinaryBlock(new JavaSparkContext( > *sc*), coordinateMatrix, *mc*, true) > > > Thanks, > > Niketan Pansare > IBM Almaden Research Center > E-mail: npansar At us.ibm.com > http://researcher.watson.ibm.com/researcher/view.php?person=us-npansar > > [image: Inactive hide details for Deron Eriksson ---12/07/2015 02:50:30 > PM---Hi Sourav, One way to generate Javadocs for the entire Sys]Deron > Eriksson ---12/07/2015 02:50:30 PM---Hi Sourav, One way to generate > Javadocs for the entire SystemML project is "mvn > > From: Deron Eriksson <[email protected]> > To: [email protected] > Date: 12/07/2015 02:50 PM > Subject: Re: API documentation for SystemML > ------------------------------ > > > > Hi Sourav, > > One way to generate Javadocs for the entire SystemML project is "mvn > javadoc:javadoc". > > Unfortunately, classes such as MatrixCharacteristics and RDDConverterUtils > currently have very minimal API documentation. We are hoping to address > this in the near future. However, you may find that the following > documentation link could be of assistance in getting started, given your > interest in Scala: > > http://apache.github.io/incubator-systemml/mlcontext-programming-guide.html > > Deron > > > On Mon, Dec 7, 2015 at 1:58 PM, Sourav Mazumder < > [email protected] > > wrote: > > > Hi, > > > > Is there any Scala/Java API documentation available for classes like > > > > MatrixCharacteristics, RDDConverterUtils ? > > > > What I need to understand is what all such helper utilities available > > and the deatils of their signature/APIs. > > > > Regards, > > > > Sourav > > > > >
