GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21237
[SPARK-23325][WIP] Test parquet returning internal row ## What changes were proposed in this pull request? This updates `ParquetFileFormat` to return `InternalRow` instead of `UnsafeRow` to get a rough assessment of how many code paths depend on interfaces that return `InternalRow` actually returning `UnsafeRow`. ## How was this patch tested? Existing tests. You can merge this pull request into a Git repository by running: $ git pull https://github.com/rdblue/spark test-parquet-returning-internal-row Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21237.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21237 ---- commit cdf2b4db2f8a94cbe72fe08bfdbde55f12290679 Author: Ryan Blue <blue@...> Date: 2018-04-24T19:55:25Z SPARK-24073: Rename DataReaderFactory to ReadTask. This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. commit 609ec1409892228b5833375fc915fb48b6fb0137 Author: Ryan Blue <blue@...> Date: 2018-04-25T00:04:17Z SPARK-24073: Update private methods in DataSourceV2ScanExec. commit faf8fd3c5b4c1bb7d82cadbe4828955bb90a60b0 Author: Ryan Blue <blue@...> Date: 2018-04-24T19:55:25Z SPARK-24073: Rename DataReaderFactory to ReadTask. This reverses the changes in SPARK-23219, which renamed ReadTask to DataReaderFactory. The intent of that change was to make the read and write API match (write side uses DataWriterFactory), but the underlying problem is that the two classes are not equivalent. ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a specific read task for a partition of the data to be read, in contrast to DataWriterFactory where the same factory instance is used in all write tasks. ReadTask's purpose is to manage the lifecycle of DataReader with an explicit create operation to mirror the close operation. This is no longer clear from the API, where DataReaderFactory appears to be more generic than it is and it isn't clear why a set of them is produced for a read. commit 16f1b6e7fd8b658feabf25d7c1a354390dcd8eaa Author: Ryan Blue <blue@...> Date: 2018-04-20T20:15:58Z SPARK-23325: Use InternalRow when reading with DataSourceV2. This updates the DataSourceV2 API to use InternalRow instead of Row for the default case with no scan mix-ins. Because the API is changing significantly in the same places, this also renames ReaderFactory back to ReadTask. Support for readers that produce Row is added through SupportsDeprecatedScanRow, which matches the previous API. Readers that used Row now implement this class and should be migrated to InternalRow. Readers that previously implemented SupportsScanUnsafeRow have been migrated to use no SupportsScan mix-ins and produce InternalRow. commit ab8bf231d8a2092ab4f37510c613afd4eae5a999 Author: Ryan Blue <blue@...> Date: 2018-05-04T19:03:11Z Return InternalRow from ParquetFileFormat. ---- --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org