Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Anthony Thomas
Thanks Matthias - unfortunately I'm still running into an
ArrayIndexOutOfBounds exception both in reading the file as IJV and when
calling dataFrametoBinaryBlock. Just to confirm: I downloaded and compiled
the latest version using:

git clone https://github.com/apache/systemml
cd systemml
mvn clean package

mvn -version
Apache Maven 3.3.9
Maven home: /usr/share/maven
Java version: 1.8.0_151, vendor: Oracle Corporation
Java home: /usr/lib/jvm/java-8-oracle/jre
Default locale: en_US, platform encoding: UTF-8
OS name: "linux", version: "4.4.0-103-generic", arch: "amd64", family: "unix"

I have a simple driver script written in Scala which calls the API methods.
I compile the script using SBT (version 1.0.4) and submit using
spark-submit (spark version 2.2.0). Here's how I'm calling the methods:

val x = spark.read.parquet(inputPath).select(featureNames)
val mc = new MatrixCharacteristics(199563535L, 71403L, 1024, 1024,
2444225947L) // as far as I know 1024x1024 is default block size in sysml?
println("Reading Direct")
val xrdd = RDDConverterUtils.dataFrameToBinaryBlock(jsc, x, mc,
false, true)
xrdd.count

here is the stacktrace from calling dataFrameToBinaryBlock:

 java.lang.ArrayIndexOutOfBoundsException: 0
at
org.apache.sysml.runtime.matrix.data.SparseRowVector.append(SparseRowVector.java:196)
at
org.apache.sysml.runtime.matrix.data.SparseBlockMCSR.append(SparseBlockMCSR.java:267)
at
org.apache.sysml.runtime.matrix.data.MatrixBlock.appendValue(MatrixBlock.java:685)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:1067)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDConverterUtils$DataFrameToBinaryBlockFunction.call(RDDConverterUtils.java:999)
at
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
at
org.apache.spark.api.java.JavaRDDLike$$anonfun$fn$7$1.apply(JavaRDDLike.scala:186)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

and here is the stacktrace from calling "read()" directly:

java.lang.ArrayIndexOutOfBoundsException: 2
at
org.apache.sysml.runtime.matrix.data.SparseBlockCOO.sort(SparseBlockCOO.java:399)
at
org.apache.sysml.runtime.matrix.data.MatrixBlock.mergeIntoSparse(MatrixBlock.java:1784)
at
org.apache.sysml.runtime.matrix.data.MatrixBlock.merge(MatrixBlock.java:1687)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:627)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:596)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction2$1.apply(JavaPairRDD.scala:1037)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)
at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Best,

Anthony


On Sun, Dec 24, 2017 at 3:14 AM, Matthias Boehm  wrote:

> Thanks again for c

Re: Build server down

2017-12-24 Thread Ted Yu
The slow down was wide spread.

I logged INFRA-15716 .

On Sun, Dec 24, 2017 at 8:22 AM, Glenn Weidner  wrote:

> It appears there's an issue with Jenkins service since I'm able to access
> https://sparktc.ibmcloud.com/repo/latest/ but not
> https://sparktc.ibmcloud.com/jenkins/. Unfortunately, I don't have direct
> access to the machine to debug further.
>
> Regards,
> Glenn
>
>
> [image: Inactive hide details for Matthias Boehm ---12/24/2017 07:21:37
> AM---Hi all, could somebody at the STC please find out what's w]Matthias
> Boehm ---12/24/2017 07:21:37 AM---Hi all, could somebody at the STC please
> find out what's wrong with our build
>
> From: Matthias Boehm 
> To: dev@systemml.apache.org
> Date: 12/24/2017 07:21 AM
> Subject: Build server down
> --
>
>
>
> Hi all,
>
> could somebody at the STC please find out what's wrong with our build
> server? Thanks and Happy Holidays.
>
> Regards,
> Matthias
>
>
>
>
>


Re: Build server down

2017-12-24 Thread Glenn Weidner

It appears there's an issue with Jenkins service since I'm able to access
https://sparktc.ibmcloud.com/repo/latest/ but not
https://sparktc.ibmcloud.com/jenkins/.  Unfortunately, I don't have direct
access to the machine to debug further.

Regards,
Glenn




From:   Matthias Boehm 
To: dev@systemml.apache.org
Date:   12/24/2017 07:21 AM
Subject:Build server down



Hi all,

could somebody at the STC please find out what's wrong with our build
server? Thanks and Happy Holidays.

Regards,
Matthias





Build server down

2017-12-24 Thread Matthias Boehm

Hi all,

could somebody at the STC please find out what's wrong with our build 
server? Thanks and Happy Holidays.


Regards,
Matthias


Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Matthias Boehm
Thanks again for catching this issue Anthony - this IJV reblock issue 
with large ultra-sparse matrices is now fixed in master. It likely did 
not show up on the 1% sample because the data was small enough to read 
it directly into the driver.


However, the dataFrameToBinaryBlock might be another issue that I could 
not reproduce yet, so it would be very helpful if you could give it 
another try. Thanks.


Regards,
Matthias

On 12/24/2017 9:57 AM, Matthias Boehm wrote:

Hi Anthony,

thanks for helping to debug this issue. There are no limits other than
the dimensions and number of non-zeros being of type long. It sounds
more like an issues of converting special cases of ultra-sparse
matrices. I'll try to reproduce this issue and give an update as soon as
I know more. In the meantime, could you please (a) also provide the
stacktrace of calling dataFrameToBinaryBlock with SystemML 1.0, and (b)
try calling your IJV conversion script via spark submit to exclude that
this issue is API-related? Thanks.

Regards,
Matthias

On 12/24/2017 1:40 AM, Anthony Thomas wrote:

Okay thanks for the suggestions - I upgraded to 1.0 and tried providing
dimensions and blocksizes to dataFrameToBinaryBlock both without
success. I
additionally wrote out the matrix to hdfs in IJV format and am still
getting the same error when calling "read()" directly in the DML.
However,
I created a 1% sample of the original data in IJV format and SystemML was
able to read the smaller file without any issue. This would seem to
suggest
that either there is some corruption in the full file or I'm running into
some limit. The matrix is on the larger side: 1.9e8 rows by 7e4 cols with
2.4e9 nonzero values, but this seems like it should be well within the
limits of what SystemML/Spark can handle. I also checked for obvious data
errors (file is not 1 indexed or contains blank lines). In case it's
helpful, the stacktrace from reading the data from hdfs in IJV format is
attached. Thanks again for your help - I really appreciate it.

 00:24:18 WARN TaskSetManager: Lost task 30.0 in stage 0.0 (TID 126,
10.11.10.13, executor 0): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.sysml.runtime.matrix.data.SparseBlockCOO.shiftRightByN(SparseBlockCOO.java:594)

at
org.apache.sysml.runtime.matrix.data.SparseBlockCOO.set(SparseBlockCOO.java:323)

at
org.apache.sysml.runtime.matrix.data.MatrixBlock.mergeIntoSparse(MatrixBlock.java:1790)

at
org.apache.sysml.runtime.matrix.data.MatrixBlock.merge(MatrixBlock.java:1736)

at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:627)

at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:596)

at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction2$1.apply(JavaPairRDD.scala:1037)

at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)

at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)

at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)

at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)

at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)

at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)

at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)

at org.apache.spark.scheduler.Task.run(Task.scala:108)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)

at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

at java.lang.Thread.run(Thread.java:748)

Anthony


On Sat, Dec 23, 2017 at 4:27 AM, Matthias Boehm 
wrote:


Given the line numbers from the stacktrace, it seems that you use a
rather
old version of SystemML. Hence, I would recommend to upgrade to SystemML
1.0 or at least 0.15 first.

If the error persists or you're not able to upgrade, please try to call
dataFrameToBinaryBlock with provided matrix characteristics of
dimensions
and blocksizes. The issue you've shown usually originates from incorrect
meta data (e.g., negative number of columns or block sizes), which
prevents
the sparse rows from growing to the necessary sizes.

Regards,
Matthias

On 12/22/2017 10:42 PM, Anthony Thomas wrote:


Hi Matthias,

Thanks for the help! In response to your questions:

   1. Sorry - this was a typo: the correct schema is: [y: int,
features:
   vector] - the column "features" was created using Spark's
Vecto

Re: Passing a CoordinateMatrix to SystemML

2017-12-24 Thread Matthias Boehm

Hi Anthony,

thanks for helping to debug this issue. There are no limits other than 
the dimensions and number of non-zeros being of type long. It sounds 
more like an issues of converting special cases of ultra-sparse 
matrices. I'll try to reproduce this issue and give an update as soon as 
I know more. In the meantime, could you please (a) also provide the 
stacktrace of calling dataFrameToBinaryBlock with SystemML 1.0, and (b) 
try calling your IJV conversion script via spark submit to exclude that 
this issue is API-related? Thanks.


Regards,
Matthias

On 12/24/2017 1:40 AM, Anthony Thomas wrote:

Okay thanks for the suggestions - I upgraded to 1.0 and tried providing
dimensions and blocksizes to dataFrameToBinaryBlock both without success. I
additionally wrote out the matrix to hdfs in IJV format and am still
getting the same error when calling "read()" directly in the DML. However,
I created a 1% sample of the original data in IJV format and SystemML was
able to read the smaller file without any issue. This would seem to suggest
that either there is some corruption in the full file or I'm running into
some limit. The matrix is on the larger side: 1.9e8 rows by 7e4 cols with
2.4e9 nonzero values, but this seems like it should be well within the
limits of what SystemML/Spark can handle. I also checked for obvious data
errors (file is not 1 indexed or contains blank lines). In case it's
helpful, the stacktrace from reading the data from hdfs in IJV format is
attached. Thanks again for your help - I really appreciate it.

 00:24:18 WARN TaskSetManager: Lost task 30.0 in stage 0.0 (TID 126,
10.11.10.13, executor 0): java.lang.ArrayIndexOutOfBoundsException
at java.lang.System.arraycopy(Native Method)
at
org.apache.sysml.runtime.matrix.data.SparseBlockCOO.shiftRightByN(SparseBlockCOO.java:594)
at
org.apache.sysml.runtime.matrix.data.SparseBlockCOO.set(SparseBlockCOO.java:323)
at
org.apache.sysml.runtime.matrix.data.MatrixBlock.mergeIntoSparse(MatrixBlock.java:1790)
at
org.apache.sysml.runtime.matrix.data.MatrixBlock.merge(MatrixBlock.java:1736)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:627)
at
org.apache.sysml.runtime.instructions.spark.utils.RDDAggregateUtils$MergeBlocksFunction.call(RDDAggregateUtils.java:596)
at
org.apache.spark.api.java.JavaPairRDD$$anonfun$toScalaFunction2$1.apply(JavaPairRDD.scala:1037)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:189)
at
org.apache.spark.util.collection.ExternalSorter$$anonfun$5.apply(ExternalSorter.scala:188)
at
org.apache.spark.util.collection.AppendOnlyMap.changeValue(AppendOnlyMap.scala:150)
at
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(SizeTrackingAppendOnlyMap.scala:32)
at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:194)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

Anthony


On Sat, Dec 23, 2017 at 4:27 AM, Matthias Boehm  wrote:


Given the line numbers from the stacktrace, it seems that you use a rather
old version of SystemML. Hence, I would recommend to upgrade to SystemML
1.0 or at least 0.15 first.

If the error persists or you're not able to upgrade, please try to call
dataFrameToBinaryBlock with provided matrix characteristics of dimensions
and blocksizes. The issue you've shown usually originates from incorrect
meta data (e.g., negative number of columns or block sizes), which prevents
the sparse rows from growing to the necessary sizes.

Regards,
Matthias

On 12/22/2017 10:42 PM, Anthony Thomas wrote:


Hi Matthias,

Thanks for the help! In response to your questions:

   1. Sorry - this was a typo: the correct schema is: [y: int, features:
   vector] - the column "features" was created using Spark's
VectorAssembler
   and the underlying type is an org.apache.spark.ml.linalg.SparseVector.
   Calling x.schema results in: org.apache.spark.sql.types.StructType =
   StructType(StructField(features,org.apache.spark.ml.
   linalg.VectorUDT@3bfc3ba7,true)
   2. "y" converts fine - it appears the only issue is with X. The script
   still crashes when running "print(sum(X))". The full stack trace is
   attached at the end of the message.
   3. Unfortunately, the error persists when calling