Hi SystemML folks,

I'm trying to pass some data from Spark to a DML script via the MLContext
API. The data is derived from a parquet file containing a dataframe with
the schema: [label: Integer, features: SparseVector]. I am doing the
following:

        val input_data = spark.read.parquet(inputPath)
        val x = input_data.select("features")
        val y = input_data.select("y")
        val x_meta = new MatrixMetadata(DF_VECTOR)
        val y_meta = new MatrixMetadata(DF_DOUBLES)
        val script = dmlFromFile(s"${script_path}/script.dml").
                in("X", x, x_meta).
                in("Y", y, y_meta)
        ...

However, this results in an error from SystemML:
java.lang.ArrayIndexOutOfBoundsException: 0
I'm guessing this has something to do with SparkML being zero indexed and
SystemML being 1 indexed. Is there something I should be doing differently
here? Note that I also tried converting the dataframe to a CoordinateMatrix
and then creating an RDD[String] in IJV format. That too resulted in
"ArrayIndexOutOfBoundsExceptions." I'm guessing there's something simple
I'm doing wrong here, but I haven't been able to figure out exactly what.
Please let me know if you need more information (I can send along the full
error stacktrace if that would be helpful)!

Thanks,

Anthony

Reply via email to