[GitHub] spark pull request #21237: [SPARK-23325][WIP] Test parquet returning interna...

2018-07-26 Thread rdblue
Github user rdblue closed the pull request at:

https://github.com/apache/spark/pull/21237


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21237: [SPARK-23325][WIP] Test parquet returning interna...

2018-05-04 Thread rdblue
GitHub user rdblue opened a pull request:

https://github.com/apache/spark/pull/21237

[SPARK-23325][WIP] Test parquet returning internal row

## What changes were proposed in this pull request?

This updates `ParquetFileFormat` to return `InternalRow` instead of 
`UnsafeRow` to get a rough assessment of how many code paths depend on 
interfaces that return `InternalRow` actually returning `UnsafeRow`.

## How was this patch tested?

Existing tests.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/rdblue/spark 
test-parquet-returning-internal-row

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21237.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21237


commit cdf2b4db2f8a94cbe72fe08bfdbde55f12290679
Author: Ryan Blue 
Date:   2018-04-24T19:55:25Z

SPARK-24073: Rename DataReaderFactory to ReadTask.

This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.

ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.

commit 609ec1409892228b5833375fc915fb48b6fb0137
Author: Ryan Blue 
Date:   2018-04-25T00:04:17Z

SPARK-24073: Update private methods in DataSourceV2ScanExec.

commit faf8fd3c5b4c1bb7d82cadbe4828955bb90a60b0
Author: Ryan Blue 
Date:   2018-04-24T19:55:25Z

SPARK-24073: Rename DataReaderFactory to ReadTask.

This reverses the changes in SPARK-23219, which renamed ReadTask to
DataReaderFactory. The intent of that change was to make the read and
write API match (write side uses DataWriterFactory), but the underlying
problem is that the two classes are not equivalent.

ReadTask/DataReader function as Iterable/Iterator. One ReadTask is a
specific read task for a partition of the data to be read, in contrast
to DataWriterFactory where the same factory instance is used in all
write tasks. ReadTask's purpose is to manage the lifecycle of DataReader
with an explicit create operation to mirror the close operation. This is
no longer clear from the API, where DataReaderFactory appears to be more
generic than it is and it isn't clear why a set of them is produced for
a read.

commit 16f1b6e7fd8b658feabf25d7c1a354390dcd8eaa
Author: Ryan Blue 
Date:   2018-04-20T20:15:58Z

SPARK-23325: Use InternalRow when reading with DataSourceV2.

This updates the DataSourceV2 API to use InternalRow instead of Row for
the default case with no scan mix-ins. Because the API is changing
significantly in the same places, this also renames ReaderFactory back
to ReadTask.

Support for readers that produce Row is added through
SupportsDeprecatedScanRow, which matches the previous API. Readers that
used Row now implement this class and should be migrated to InternalRow.

Readers that previously implemented SupportsScanUnsafeRow have been
migrated to use no SupportsScan mix-ins and produce InternalRow.

commit ab8bf231d8a2092ab4f37510c613afd4eae5a999
Author: Ryan Blue 
Date:   2018-05-04T19:03:11Z

Return InternalRow from ParquetFileFormat.




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org