[ https://issues.apache.org/jira/browse/FLINK-5423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15824436#comment-15824436 ]
ASF GitHub Bot commented on FLINK-5423: --------------------------------------- Github user Fokko commented on a diff in the pull request: https://github.com/apache/flink/pull/3077#discussion_r96289143 --- Diff: flink-libraries/flink-ml/src/test/scala/org/apache/flink/ml/outlier/StochasticOutlierSelectionITSuite.scala --- @@ -0,0 +1,240 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.flink.ml.outlier + +import breeze.linalg.{sum, DenseVector => BreezeDenseVector} +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.DenseVector +import org.apache.flink.ml.outlier.StochasticOutlierSelection.BreezeLabeledVector +import org.apache.flink.ml.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + +class StochasticOutlierSelectionITSuite extends FlatSpec with Matchers with FlinkTestBase { + behavior of "Stochastic Outlier Selection algorithm" + val EPSILON = 1e-16 + + /* + Unit-tests created based on the Python scripts of the algorithms author' + https://github.com/jeroenjanssens/scikit-sos + + For more information about SOS, see https://github.com/jeroenjanssens/sos + J.H.M. Janssens, F. Huszar, E.O. Postma, and H.J. van den Herik. Stochastic + Outlier Selection. Technical Report TiCC TR 2012-001, Tilburg University, + Tilburg, the Netherlands, 2012. + */ + + val perplexity = 3 + val errorTolerance = 0 + val maxIterations = 5000 + val parameters = new StochasticOutlierSelection().setPerplexity(perplexity).parameters + + val env = ExecutionEnvironment.getExecutionEnvironment + + it should "Compute the perplexity of the vector and return the correct error" in { + val vector = BreezeDenseVector(Array(1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 8.0, 9.0, 10.0)) + + val output = Array( + 0.39682901665799636, + 0.15747326846175236, + 0.06248996227359784, + 0.024797830280027126, + 0.009840498605275054, + 0.0039049953849556816, + 6.149323865970302E-4, + 2.4402301428445443E-4, + 9.683541280042027E-5 + ) + + val search = StochasticOutlierSelection.binarySearch( + vector, + Math.log(perplexity), + maxIterations, + errorTolerance + ).toArray + + search should be(output) + } + + it should "Compute the distance matrix and give symmetrical distances" in { + + val data = env.fromCollection(List( + BreezeLabeledVector(0, BreezeDenseVector(Array(1.0, 3.0))), + BreezeLabeledVector(1, BreezeDenseVector(Array(5.0, 1.0))) + )) + + val distanceMatrix = StochasticOutlierSelection + .computeDissimilarityVectors(data) + .map(_.data) + .collect() + .toArray + + print(distanceMatrix) --- End diff -- Oops, still in there from the debugging. > Implement Stochastic Outlier Selection > -------------------------------------- > > Key: FLINK-5423 > URL: https://issues.apache.org/jira/browse/FLINK-5423 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library > Reporter: Fokko Driesprong > Assignee: Fokko Driesprong > > I've implemented the Stochastic Outlier Selection (SOS) algorithm by Jeroen > Jansen. > http://jeroenjanssens.com/2013/11/24/stochastic-outlier-selection.html > Integrated as much as possible with the components from the machine learning > library. > The algorithm itself has been compared to four other algorithms and it it > shows that SOS has a higher performance on most of these real-world datasets. -- This message was sent by Atlassian JIRA (v6.3.4#6332)