Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15721#discussion_r93171343 --- Diff: mllib/src/test/scala/org/apache/spark/ml/util/MLTestingUtils.scala --- @@ -224,4 +208,139 @@ object MLTestingUtils extends SparkFunSuite { }.toDF() (overSampledData, weightedData) } + + /** + * Generates a linear prediction function where the coefficients are generated randomly. + * The function produces a continuous (numClasses = 0) or categorical (numClasses > 0) label. + */ + def getRandomLinearPredictionFunction( + numFeatures: Int, + numClasses: Int, + seed: Long): (Vector => Double) = { + val rng = new scala.util.Random(seed) + val trueNumClasses = if (numClasses == 0) 1 else numClasses + val coefArray = Array.fill(numFeatures * trueNumClasses)(rng.nextDouble - 0.5) + (features: Vector) => { + if (numClasses == 0) { + BLAS.dot(features, new DenseVector(coefArray)) + } else { + val margins = new DenseVector(new Array[Double](numClasses)) + val coefMat = new DenseMatrix(numClasses, numFeatures, coefArray) + BLAS.gemv(1.0, coefMat, features, 1.0, margins) + margins.argmax.toDouble + } + } + } + + /** + * A helper function to generate synthetic data. Generates random feature values, + * both categorical and continuous, according to `categoricalFeaturesInfo`. The label is generated + * from a random prediction function, and noise is added to the true label. + * + * @param numPoints The number of data points to generate. + * @param numClasses The number of classes the outcome can take on. 0 for continuous labels. + * @param numFeatures The number of features in the data. + * @param categoricalFeaturesInfo Map of (featureIndex -> numCategories) for categorical features. + * @param seed Random seed. + * @param noiseLevel A number in [0.0, 1.0] indicating how much noise to add to the label. + * @return Generated sequence of noisy instances. + */ + def generateNoisyData( --- End diff -- I am a bit worried whether we should provide this general noisy data generation function: * It's better we can generate data following the rule of specific algorithms, for example, users provide coefficients, the mean and variance of generated features for ```LogisticRegression```. * Actually, some generators such as [```LinearDataGenerator.generateLinearInput```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/LinearDataGenerator.scala#L97) has already considered the noise level. Just like ```LinearDataGenerator.generateLinearInput```, I think we should add argument ```eps``` for other generators such as ```LogisticRegressionSuite.generateLogisticInput, LogisticRegressionSuite.generateMultinomialLogisticInput, NaiveBayesSuite.generateNaiveBayesInput```, to make them output noisy label natively.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org