Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19979 > assume that Dataset.collect() returns the Rows in a fixed order. I'm quite sure that: * When the Dataset has been constructed without any shuffles or repartitions, then Rows are always in a fixed order. * When there has been a shuffle, it is likely the Rows will not follow a fixed order. * Spark APIs never guarantee a fixed order, except when sorting has been performed. Basically, we should try to avoid design patterns which assume fixed Row orders. It may be safe sometimes, but the assumption can lead to mistakes. > There're two cases which can use globalCheckFunction > * test statistics (such as min/max ) on global transformer output > * get global result array and compare it with hardcoding array values. For test statistics, globalCheckFunction makes sense. * But none of the tests in this PR require this. Are there any unit tests in MLlib which do? For comparing results with expected values, I *much* prefer for those values to be in a column in the original input dataset. That has 2 benefits: * It makes tests easier to read since inputs + expected values are side-by-side in the code. * We don't have to worry about Row order.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org