Github user Dooyoung-Hwang commented on a diff in the pull request: https://github.com/apache/spark/pull/22219#discussion_r213200433 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -3237,6 +3237,28 @@ class Dataset[T] private[sql]( files.toSet.toArray } + /** + * Returns the tuple of the row count and an iterator that contains all rows in this Dataset. + * + * The iterator will consume as much memory as the total size of serialized results which can be + * limited with the config 'spark.driver.maxResultSize'. Rows are deserialized when iterating rows + * with returned iterator. Whether to collect all deserialized rows or to iterate them + * incrementally can be decided with considering total rows count and driver memory. + */ + def collectCountAndIterator(): (Long, Iterator[T]) = --- End diff -- Ok. I agree with you.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org