GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21145
SPARK-24073: Rename DataReaderFactory to ReadTask. ## What changes were proposed in this pull request? This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. ## How was this patch tested? Existing tests, which have been updated to use the new name. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark SPARK-24073-revert-data-reader-factory-rename Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21145.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21145 ---- commit c364c05d3141bbe0ed29a2b02cecfa541d9c8212 Author: Ryan Blue <blue@...> Date: 2018-04-24T19:55:25Z SPARK-24073: Rename DataReaderFactory to ReadTask. This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org