[GitHub] spark issue #19979: [SPARK-22881][ML][TEST] ML regression package testsuite ...

jkbradley Thu, 28 Dec 2017 13:19:14 -0800

Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19979
  
    > assume that Dataset.collect() returns the Rows in a fixed order.
    
    I'm quite sure that:
    * When the Dataset has been constructed without any shuffles or 
repartitions, then Rows are always in a fixed order.
    * When there has been a shuffle, it is likely the Rows will not follow a 
fixed order.
    * Spark APIs never guarantee a fixed order, except when sorting has been 
performed.
    
    Basically, we should try to avoid design patterns which assume fixed Row 
orders.  It may be safe sometimes, but the assumption can lead to mistakes.
    
    > There're two cases which can use globalCheckFunction
    >  * test statistics (such as min/max ) on global transformer output
    >  * get global result array and compare it with hardcoding array values.
    
    For test statistics, globalCheckFunction makes sense.
    * But none of the tests in this PR require this.  Are there any unit tests 
in MLlib which do?
    
    For comparing results with expected values, I *much* prefer for those 
values to be in a column in the original input dataset.  That has 2 benefits:
    * It makes tests easier to read since inputs + expected values are 
side-by-side in the code.
    * We don't have to worry about Row order.




---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #19979: [SPARK-22881][ML][TEST] ML regression package testsuite ...

Reply via email to