[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

yanboliang Thu, 29 Jun 2017 03:25:36 -0700

Github user yanboliang commented on a diff in the pull request:

    https://github.com/apache/spark/pull/16699#discussion_r124753530
  
    --- Diff: 
mllib/src/main/scala/org/apache/spark/ml/regression/GeneralizedLinearRegression.scala
 ---
    @@ -961,14 +1008,16 @@ class GeneralizedLinearRegressionModel private[ml] (
       }
     
       override protected def transformImpl(dataset: Dataset[_]): DataFrame = {
    --- End diff --
    
    I summarized all four cases for making prediction as following:
    
    Estimator(training data)  | Transformer(prediction data) | How R predict | 
How Spark predict
    ------------------------- | ----------------------------- | --------------- 
| ------------------
    w/ offset column | w/ offset column | use offset of prediction data | use 
offset of prediction data
    w/ offset column | w/o offset column | use offset of training data | not 
use offset
    w/o offset column | w/ offset column | not use offset | not use offset
    w/o offset column | w/o offset column | not use offset | not use offset
    
    For case 1 and 4, there is not that controversial.
    For case 2, the reason behind a different way to handle is we can't store 
all ```offset``` data in our model like what R does, but we should print a 
warning log to let users know that is different from R.
    For case 3, in your current implementation, it ignores whether the model 
was trained with offset. I think it might be worth discussing. I think the 
correct way should consider whether the model was trained with offset. If the 
model was trained without offset, we should ignore the offset column when 
making prediction on new dataset. Or at least, we should print out warning to 
remind users.
    However, I think we can discuss and resolve this issue in follow-up work. 
@actuaryzhang What do you think my proposal of how Spark make prediction? 
Thanks.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #16699: [SPARK-18710][ML] Add offset in GLM

Reply via email to