GitHub user rdblue opened a pull request:

    https://github.com/apache/spark/pull/21237

    [SPARK-23325][WIP] Test parquet returning internal row

    ## What changes were proposed in this pull request?
    
    This updates `ParquetFileFormat` to return `InternalRow` instead of 
`UnsafeRow` to get a rough assessment of how many code paths depend on 
interfaces that return `InternalRow` actually returning `UnsafeRow`.
    
    ## How was this patch tested?
    
    Existing tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/rdblue/spark 
test-parquet-returning-internal-row

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21237
    
----
commit cdf2b4db2f8a94cbe72fe08bfdbde55f12290679
Author: Ryan Blue <blue@...>
Date:   2018-04-24T19:55:25Z

    SPARK-24073: Rename DataReaderFactory to ReadTask.
    
    This reverses the changes in SPARK-23219, which renamed ReadTask to
    DataReaderFactory. The intent of that change was to make the read and
    write API match (write side uses DataWriterFactory), but the underlying
    problem is that the two classes are not equivalent.
    
    ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
    specific read task for a partition of the data to be read, in contrast
    to DataWriterFactory where the same factory instance is used in all
    write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
    with an explicit create operation to mirror the close operation. This is
    no longer clear from the API, where DataReaderFactory appears to be more
    generic than it is and it isn't clear why a set of them is produced for
    a read.

commit 609ec1409892228b5833375fc915fb48b6fb0137
Author: Ryan Blue <blue@...>
Date:   2018-04-25T00:04:17Z

    SPARK-24073: Update private methods in DataSourceV2ScanExec.

commit faf8fd3c5b4c1bb7d82cadbe4828955bb90a60b0
Author: Ryan Blue <blue@...>
Date:   2018-04-24T19:55:25Z

    SPARK-24073: Rename DataReaderFactory to ReadTask.
    
    This reverses the changes in SPARK-23219, which renamed ReadTask to
    DataReaderFactory. The intent of that change was to make the read and
    write API match (write side uses DataWriterFactory), but the underlying
    problem is that the two classes are not equivalent.
    
    ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
    specific read task for a partition of the data to be read, in contrast
    to DataWriterFactory where the same factory instance is used in all
    write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
    with an explicit create operation to mirror the close operation. This is
    no longer clear from the API, where DataReaderFactory appears to be more
    generic than it is and it isn't clear why a set of them is produced for
    a read.

commit 16f1b6e7fd8b658feabf25d7c1a354390dcd8eaa
Author: Ryan Blue <blue@...>
Date:   2018-04-20T20:15:58Z

    SPARK-23325: Use InternalRow when reading with DataSourceV2.
    
    This updates the DataSourceV2 API to use InternalRow instead of Row for
    the default case with no scan mix-ins. Because the API is changing
    significantly in the same places, this also renames ReaderFactory back
    to ReadTask.
    
    Support for readers that produce Row is added through
    SupportsDeprecatedScanRow, which matches the previous API. Readers that
    used Row now implement this class and should be migrated to InternalRow.
    
    Readers that previously implemented SupportsScanUnsafeRow have been
    migrated to use no SupportsScan mix-ins and produce InternalRow.

commit ab8bf231d8a2092ab4f37510c613afd4eae5a999
Author: Ryan Blue <blue@...>
Date:   2018-05-04T19:03:11Z

    Return InternalRow from ParquetFileFormat.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to