Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/3637#discussion_r21566135
  
    --- Diff: mllib/src/main/scala/org/apache/spark/ml/LabeledPoint.scala ---
    @@ -0,0 +1,52 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *    http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +
    +package org.apache.spark.ml
    +
    +import scala.beans.BeanInfo
    +
    +import org.apache.spark.annotation.AlphaComponent
    +import org.apache.spark.mllib.linalg.Vector
    +
    +/**
    + * :: AlphaComponent ::
    + * Class that represents an instance (data point) for prediction tasks.
    + *
    + * @param label Label to predict
    + * @param features List of features describing this instance
    + * @param weight Instance weight
    + */
    +@AlphaComponent
    +@BeanInfo
    +case class LabeledPoint(label: Double, features: Vector, weight: Double) {
    --- End diff --
    
    For optimizing storage, I think we can do it in either case (Vector for 
weakly typing, or Array[Double] for strong typing).
    
    I too agree both options are good & just wanted to hash out the pros & cons.
    
    > Is there really a case where the user doesn't know schema types, suggests 
a type, and lets the framework override it?
    
    That's not quite what I meant.  I was thinking of 2 use cases:
    1. An expert user specifies types.  The algorithm should use those types.
    2. A beginner user does not explicitly specify types, but uses the typed 
API so that loaded data are given default types (probably continuous).  
Following the first case, the algorithm should use the given types, but 
following intuition, the algorithm should infer types.
    This is less of an issue if the typed interface is designated as an expert 
API.
    
    I'm starting to wonder if the typed interface should be completely public.  
I'll move this to a non-inline comment so the rest of our discussion does not 
get hidden in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to