well, let's do the following to figure this out:

1) If the schema is indeed [label: Integer, features: SparseVector], please change the third line to val y = input_data.select("label").

2) For debugging, I would recommend to use a simple script like "print(sum(X));" and try converting X and y separately to isolate the problem.

3) If it's still failing, it would be helpful to known (a) if it's an issue of converting X, y, or both, as well as (b) the full stacktrace.

4) As a workaround you might also call our internal converter directly via:
RDDConverterUtils.dataFrameToBinaryBlock(jsc, df, mc, containsID, isVector), where jsc is the java spark context, df is the dataset, mc are matrix characteristics (if unknown, simply use new MatrixCharacteristics()), containsID indicates if the dataset contains a column "__INDEX" with the row indexes, and isVector indicates if the passed datasets contains vectors or basic types such as double.


Regards,
Matthias

On 12/22/2017 12:00 AM, Anthony Thomas wrote:
Hi SystemML folks,

I'm trying to pass some data from Spark to a DML script via the MLContext
API. The data is derived from a parquet file containing a dataframe with
the schema: [label: Integer, features: SparseVector]. I am doing the
following:

        val input_data = spark.read.parquet(inputPath)
        val x = input_data.select("features")
        val y = input_data.select("y")
        val x_meta = new MatrixMetadata(DF_VECTOR)
        val y_meta = new MatrixMetadata(DF_DOUBLES)
        val script = dmlFromFile(s"${script_path}/script.dml").
                in("X", x, x_meta).
                in("Y", y, y_meta)
        ...

However, this results in an error from SystemML:
java.lang.ArrayIndexOutOfBoundsException: 0
I'm guessing this has something to do with SparkML being zero indexed and
SystemML being 1 indexed. Is there something I should be doing differently
here? Note that I also tried converting the dataframe to a CoordinateMatrix
and then creating an RDD[String] in IJV format. That too resulted in
"ArrayIndexOutOfBoundsExceptions." I'm guessing there's something simple
I'm doing wrong here, but I haven't been able to figure out exactly what.
Please let me know if you need more information (I can send along the full
error stacktrace if that would be helpful)!

Thanks,

Anthony

Reply via email to