[ https://issues.apache.org/jira/browse/SPARK-7529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14562199#comment-14562199 ]
Joseph K. Bradley edited comment on SPARK-7529 at 5/28/15 3:05 AM: ------------------------------------------------------------------- *spark.mllib: Issues found in a pass through the spark.mllib package* * _I have marked certain items with_: "*SHOULD FIX*", "_(not important)_" ** Some APIs cannot be changed, so "fixing" may mean adding a Java version of a method. h3. Classification LogisticRegressionModel + SVMModel * scala.Option<Object> getThreshold() *--> SHOULD FIX: make Java version?* NaiveBayesModel * "Java-friendly constructor": NaiveBayesModel(Iterable<Object> labels, Iterable<Object> pi, Iterable<Iterable<Object>> theta) *--> SHOULD FIX* h3. Clustering DistributedLDAModel * RDD<scala.Tuple2<Object,Vector>> topicDistributions() *--> SHOULD FIX: make Java version?* GaussianMixtureModel + KMeansModel + NaiveBayesModel * RDD<Object> predict(RDD<Vector> points) *--> SHOULD FIX with Java version* StreamingKMeans *--> SHOULD FIX with Java versions* * DStream<Object> predictOn(DStream<Vector> data) * <K> DStream<scala.Tuple2<K,Object>> predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> evidence$1) h3. Evaluation AreaUnderCurve *--> SHOULD FIX with Java versions* * static double of(scala.collection.Iterable<scala.Tuple2<Object,Object>> curve) * static double of(RDD<scala.Tuple2<Object,Object>> curve) BinaryClassificationMetrics *--> SHOULD FIX* * LOTS (everything taking/returning an RDD) RankingMetrics constructor * RankingMetrics(RDD<scala.Tuple2<Object,Object>> predictionAndLabels, scala.reflect.ClassTag<T> evidence$1) *--> SHOULD FIX with Java version* h3. Feature Word2VecModel * scala.Tuple2<String,Object>[] findSynonyms *--> SHOULD FIX with class to replace tuple?* h3. Linalg SparseMatrix * static SparseMatrix fromCOO(int numRows, int numCols, scala.collection.Iterable<scala.Tuple3<Object,Object,Object>> entries) *--> SHOULD FIX with Java version?* Vectors * static Vector sparse(int size, scala.collection.Seq<scala.Tuple2<Object,Object>> elements) *--> SHOULD FIX with Java version?* BlockMatrix *--> SHOULD FIX with Java versions* * RDD<scala.Tuple2<scala.Tuple2<Object,Object>,Matrix>> blocks() ** _This issue appears in the constructors too._ h3. Optimization _(lower priority b/c DeveloperApi which needs to be updated anyways)_ Optimizer * Vector optimize(RDD<scala.Tuple2<Object,Vector>> data, Vector initialWeights) * _Same issue appears elsewhere, wherever Double is used in a tuple._ Gradient * scala.Tuple2<Vector,Object> compute(Vector data, double label, Vector weights) h3. Recommendation MatrixFactorizationModel *--> SHOULD FIX with Java versions* * _constructor_: MatrixFactorizationModel(int rank, RDD<scala.Tuple2<Object,double[]>> userFeatures, RDD<scala.Tuple2<Object,double[]>> productFeatures) * RDD<scala.Tuple2<Object,double[]>> productFeatures() * RDD<scala.Tuple2<Object,Rating[]>> recommendProductsForUsers(int num) * RDD<scala.Tuple2<Object,Rating[]>> recommendUsersForProducts(int num) * RDD<scala.Tuple2<Object,double[]>> userFeatures() h3. Stats Statistics *--> SHOULD FIX with Java versions* * static double corr(RDD<Object> x, RDD<Object> y) * static double corr(RDD<Object> x, RDD<Object> y, String method) h3. Trees DecisionTreeModel * JavaRDD<Object> predict(JavaRDD<Vector> features) *--> SHOULD FIX* ** _This is because we use Double instead of java.lang.Double (unlike in, e.g., TreeEnsembleModel._ Split _(low priority, but should fix with Java version)_ * scala.collection.immutable.List<Object> categories() h3. util DataValidators *--> SHOULD FIX with Java versions* * static scala.Function1<RDD<LabeledPoint>,Object> binaryLabelValidator() * static scala.Function1<RDD<LabeledPoint>,Object> multiLabelValidator(int k) was (Author: josephkb): *spark.mllib: Issues found in a pass through the spark.mllib package* * _Complete, but needs to be annotated with what needs to be fixed._ h3. Classification LogisticRegressionModel + SVMModel * scala.Option<Object> getThreshold() NaiveBayesModel * "Java-friendly constructor": NaiveBayesModel(Iterable<Object> labels, Iterable<Object> pi, Iterable<Iterable<Object>> theta) h3. Clustering DistributedLDAModel * RDD<scala.Tuple2<Object,Vector>> topicDistributions() GaussianMixtureModel + KMeansModel + NaiveBayesModel * RDD<Object> predict(RDD<Vector> points) StreamingKMeans * DStream<Object> predictOn(DStream<Vector> data) * <K> DStream<scala.Tuple2<K,Object>> predictOnValues(DStream<scala.Tuple2<K,Vector>> data, scala.reflect.ClassTag<K> evidence$1) h3. Evaluation AreaUnderCurve * static double of(scala.collection.Iterable<scala.Tuple2<Object,Object>> curve) * static double of(RDD<scala.Tuple2<Object,Object>> curve) BinaryClassificationMetrics * LOTS (everything taking/returning an RDD) RankingMetrics constructor * RankingMetrics(RDD<scala.Tuple2<Object,Object>> predictionAndLabels, scala.reflect.ClassTag<T> evidence$1) h3. Feature Word2VecModel * scala.Tuple2<String,Object>[] findSynonyms h3. Linalg SparseMatrix * static SparseMatrix fromCOO(int numRows, int numCols, scala.collection.Iterable<scala.Tuple3<Object,Object,Object>> entries) Vectors * static Vector sparse(int size, scala.collection.Seq<scala.Tuple2<Object,Object>> elements) BlockMatrix * RDD<scala.Tuple2<scala.Tuple2<Object,Object>,Matrix>> blocks() ** _This issue appears in the constructors too._ h3. Optimization Optimizer * Vector optimize(RDD<scala.Tuple2<Object,Vector>> data, Vector initialWeights) * _Same issue appears elsewhere, wherever Double is used in a tuple._ Gradient * scala.Tuple2<Vector,Object> compute(Vector data, double label, Vector weights) h3. Recommendation MatrixFactorizationModel * _constructor_: MatrixFactorizationModel(int rank, RDD<scala.Tuple2<Object,double[]>> userFeatures, RDD<scala.Tuple2<Object,double[]>> productFeatures) * RDD<scala.Tuple2<Object,double[]>> productFeatures() * RDD<scala.Tuple2<Object,Rating[]>> recommendProductsForUsers(int num) * RDD<scala.Tuple2<Object,Rating[]>> recommendUsersForProducts(int num) * RDD<scala.Tuple2<Object,double[]>> userFeatures() h3. Regression GeneralizedLinearModel * RDD<Object> predict(RDD<Vector> testData) h3. Stats Statistics * static double corr(RDD<Object> x, RDD<Object> y) * static double corr(RDD<Object> x, RDD<Object> y, String method) h3. Trees DecisionTreeModel * JavaRDD<Object> predict(JavaRDD<Vector> features) ** _This is because we use Double instead of java.lang.Double (unlike in, e.g., TreeEnsembleModel._ Split * scala.collection.immutable.List<Object> categories() h3. util DataValidators * static scala.Function1<RDD<LabeledPoint>,Object> binaryLabelValidator() * static scala.Function1<RDD<LabeledPoint>,Object> multiLabelValidator(int k) > Java compatibility check for MLlib 1.4 > -------------------------------------- > > Key: SPARK-7529 > URL: https://issues.apache.org/jira/browse/SPARK-7529 > Project: Spark > Issue Type: Sub-task > Components: ML, MLlib > Affects Versions: 1.4.0 > Reporter: Xiangrui Meng > Assignee: Joseph K. Bradley > > Check Java compatibility for MLlib 1.4. We should create separate JIRAs for > each possible issue. > Checking compatibility means: > * comparing with the Scala doc > * verifying that Java docs are not messed up by Scala type incompatibilities > (E.g., check for generic "Object" types where Java cannot understand complex > Scala types. Also check Scala objects (especially with nesting!) carefully. > * If needed for complex issues, create small Java unit tests which execute > each method. (The correctness can be checked in Scala.) -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org