[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186119360 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, I suspect that the problem is that comparison is different for strings: `"17297598712"` is less than `"5"` wi

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @gatorsmile, are you happy committing this with the benchmark results? @maropu, thanks for taking the time to add these benchmarks, it is really great to have them so we can monitor

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan, actually I tried a lot of different queries yesterday, including joins and aggregations. The only thing that didn't work was `collect` for a `select * from t` because `SparkPlan` assumes

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, looking at the pushdown benchmark, it looks like ORC and Parquet either both benefit or both do not benefit from pushdown. In some cases ORC is much faster, which is due to the fact

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-03 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, are you sure about the INT and FLOAT columns? I think you might have that assessment backwards. Here's the INT results from the PR gist: ``` SQL Single INT Column Scan

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-02 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan and @jose-torres: I looked at `explain codegen` for reading from a Parquet table (with vectorized reads disabled) and it doesn't look like there is a dependency on `UnsafeRow

[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-05-02 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 @cloud-fan, that's kind of the point I was trying to make. It is too difficult to do whole-stage codegen, but we could add a codegen filter before whole-stage codegen. Why make the implementation

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-05-01 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r185310952 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala --- @@ -299,13 +299,13 @@ private[kafka010

[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-05-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 @cloud-fan, union doesn't really help. I already have support for mixed-formats working just fine. The format isn't the problem, it is filtering (and a similar problem with projection). Parquet

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to ReadTask...

2018-05-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 @cloud-fan and @henryr, do you have an opinion about naming here? --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @mswit-databricks, I wouldn't worry about that. We've limited the length of binary and string fields. In the next version of Parquet, we're planning on releasing page indexes, which

[GitHub] spark issue #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to be an in...

2018-04-30 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21122 Thanks for pointing this out, @henryr. This looks like a good change to support multiple catalogs. I think it looks fine, other than exposing `unwrapped` to get the Hive client. I think

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-04-30 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r185138677 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala --- @@ -31,10 +30,16 @@ import

[GitHub] spark pull request #21123: [SPARK-24045][SQL]Create base class for file data...

2018-04-30 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21123#discussion_r185055954 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -213,6 +215,26 @@ case class

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Is it necessary to block this commit on benchmarks? We know that it doesn't degrade performance from the Parquet benchmarks and TPC-DS run. What do you think needs to be done before this can move

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-30 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Yes, it is safe to use push-down for string columns in data written with this version. --- - To unsubscribe, e-mail: reviews

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to ReadTask...

2018-04-27 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 I think `ReadTask` is fine. That name does not imply that you can use the object itself to read, but it does correctly show that it is one task in a larger operation. I think the name implies

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to ReadTask...

2018-04-26 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 @arunmahadevan, the problem is that the current naming is misleading. This is not a factory (it only produces one specific reader) and it does not have the same lifecycle as the write-side factory

[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 Thanks for working on this, @cloud-fan! I was thinking about needing it just recently so that data sources can delegate to Spark when needed. I'll have a thorough look at it tomorrow

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Thank you @maropu! What resources does the run require? Is it something we could create a Jenkins job to run

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r184235329 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala --- @@ -299,13 +299,13 @@ private[kafka010

[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21118#discussion_r184216025 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala --- @@ -86,7 +87,7 @@ class

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 The fix for PARQUET-686 was to suppress min/max stats. It is safe to push filters, but those filters can't be used without the stats. 1.10.0 has the correct stats and can use those filters

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r184156731 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -503,7 +503,7 @@ class

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 There are two main reasons to update. First, the problem behind SPARK-17213 is fixed, hence the new min and max fields. Second, this updates the internal byte array management, which is needed

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-25 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r184139736 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala --- @@ -503,7 +503,7 @@ class

[GitHub] spark pull request #21145: SPARK-24073: Rename DataReaderFactory to ReadTask...

2018-04-24 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21145 SPARK-24073: Rename DataReaderFactory to ReadTask. ## What changes were proposed in this pull request? This reverses the changes in SPARK-23219, which renamed ReadTask

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 > Based on the previous upgrade (e.g., #13280 (comment)), we hit the performance regressions when we upgrade Parquet and we did the revert at the end. I should point out that the regress

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Okay, I don't have the time to set up and run benchmarks without a stronger case for this causing a regression (given the Parquet testing), but other people should feel free to pick this up

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @cloud-fan, is there a Jenkins job to run it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 Yeah, we should probably add a projection. It's probably only working because the InternalRows that are produced are all UnsafeRow

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @jose-torres, @cloud-fan, can you take a look at this? It updates the v2 API to use InternalRow by default. --- - To unsubscribe

[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

2018-04-20 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21118 SPARK-23325: Use InternalRow when reading with DataSourceV2. ## What changes were proposed in this pull request? This updates the DataSourceV2 API to use InternalRow instead of Row

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-20 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark issue #21111: [SPARK-23877][SQL][followup] use PhysicalOperation to si...

2018-04-20 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/2 Thanks for doing this as a follow-up. I had one minor comment, but otherwise I'm +1. I see what you mean about using `PhysicalOperation` now. It is slightly cleaner and I guess `PhysicalOperation

[GitHub] spark pull request #21111: [SPARK-23877][SQL][followup] use PhysicalOperatio...

2018-04-20 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/2#discussion_r183101013 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -114,11 +119,8 @@ case class

[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...

2018-04-19 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20988#discussion_r182858087 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -129,35 +151,41 @@ case class

[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596471 --- Diff: external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala --- @@ -146,7 +146,7 @@ private[kafka010] class

[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596392 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala --- @@ -17,52 +17,85 @@ package

[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596274 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java --- @@ -17,12 +17,12 @@ package

[GitHub] spark pull request #21029: [SPARK-23952] remove type parameter in DataReader...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21029#discussion_r182596153 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java --- @@ -52,10 +56,46

[GitHub] spark issue #20988: [SPARK-23877][SQL]: Use filter predicates to prune parti...

2018-04-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20988 @cloud-fan, I've added the test. Thanks for letting me know about HiveCatalogMetrics, that's exactly what I needed

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 I backported the Hadoop zstd codec to 2.7.3 without much trouble. But either way, I think that's a separate concern. This is about getting Parquet updated, not about worrying whether users can

[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20988#discussion_r182579387 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala --- @@ -129,35 +151,41 @@ case class

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-04-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @scottcarey, Parquet will use the compressors if they are available. You can add them from an external Jar and it will work. LZ4 should also work out of the box because it is included in Hadoop 2.7

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r182563063 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,159 @@ public final

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-16 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181899509 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +58,139 @@ public final

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-16 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181883674 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +58,139 @@ public final

[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark

2018-04-16 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21071 Some metrics to convince ourselves that using the null scope has no performance impact would be great. --- - To unsubscribe, e

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-16 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181864492 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +58,139 @@ public final

[GitHub] spark issue #21071: [SPARK-21962][CORE] Distributed Tracing in Spark

2018-04-16 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21071 @devaraj-kavali, do you have any measurements to quantify how this impacts overall performance? We would want to know before releasing this for use because using HTrace means having it on all

[GitHub] spark issue #21060: [SPARK-23942][PYTHON][SQL][BRANCH-2.3] Makes collect in ...

2018-04-16 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21060 I agree with what @srowen said: --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-04-16 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181826671 --- Diff: pom.xml --- @@ -129,7 +129,7 @@ 1.2.1 10.12.1.1 -1.8.2 +1.10.0 --- End diff -- I excluded

[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.

2018-04-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r181535144 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +58,139 @@ public final

[GitHub] spark issue #21070: SPARK-23972: Update Parquet to 1.10.0.

2018-04-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 Upstream benchmarks for buffer management changes are here: https://github.com/apache/parquet-mr/pull/390#issuecomment-338505426 That doesn't show the GC benefit for smaller buffer

[GitHub] spark pull request #21070: SPARK-23972: Update Parquet to 1.10.0.

2018-04-13 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21070 SPARK-23972: Update Parquet to 1.10.0. ## What changes were proposed in this pull request? This updates Parquet to 1.10.0 and updates the vectorized path for buffer management changes

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r181509305 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -368,8 +368,7 @@ case class FileSourceScanExec

[GitHub] spark issue #20988: [SPARK-23877][SQL]: Use filter predicates to prune parti...

2018-04-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20988 @cloud-fan or @gatorsmile, could you review this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179811255 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -368,8 +368,7 @@ case class FileSourceScanExec

[GitHub] spark pull request #20988: [SPARK-23877][SQL]: Use filter predicates to prun...

2018-04-05 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/20988 [SPARK-23877][SQL]: Use filter predicates to prune partitions in metadata-only queries ## What changes were proposed in this pull request? This updates the OptimizeMetadataOnlyQuery rule

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179521697 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -1185,6 +1185,13 @@ object SQLConf { .stringConf

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179521502 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala --- @@ -213,6 +215,26 @@ case class

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179521100 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartitionUtil.scala --- @@ -0,0 +1,225 @@ +/* + * Licensed

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179520643 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala --- @@ -0,0 +1,194 @@ +/* + * Licensed

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179516352 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileReaderFactory.scala --- @@ -0,0 +1,50 @@ +/* + * Licensed

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179514000 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala --- @@ -368,8 +368,7 @@ case class FileSourceScanExec

[GitHub] spark pull request #20933: [SPARK-23817][SQL]Migrate ORC file format read pa...

2018-04-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20933#discussion_r179513582 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala --- @@ -187,6 +189,14 @@ class DataFrameReader private[sql](sparkSession

[GitHub] spark issue #20397: [SPARK-23219][SQL]Rename ReadTask to DataReaderFactory i...

2018-03-22 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20397 @cloud-fan, can we revert this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark issue #20579: [SPARK-23372][SQL] Writing empty struct in parquet fails...

2018-03-14 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20579 I agree, we should probably add a check for storing a DataFrame with no columns for now. This is normally caught by the pre-insert rules, but since the table is getting "created" in

[GitHub] spark issue #20752: [SPARK-23559][SS] Create StreamingDataWriterFactory for ...

2018-03-06 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20752 > Right now, we're still at the point where we aren't quite sure what a streaming API needs to look like. We're starting from basically ground zero; the V1 streaming API just throws a DataFr

[GitHub] spark issue #20752: [SPARK-23559][SS] Create StreamingDataWriterFactory for ...

2018-03-06 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20752 Thanks for the clear summary, @cloud-fan. I think we want to make it easy to support batch, and then easy to reuse those internals to support streaming by adding new mix-in interfaces. Streaming

[GitHub] spark pull request #20752: [SPARK-23559][SS] Create StreamingDataWriterFacto...

2018-03-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20752#discussion_r172621472 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamingDataWriterFactory.java --- @@ -0,0 +1,51

[GitHub] spark pull request #20752: [SPARK-23559][SS] Create StreamingDataWriterFacto...

2018-03-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20752#discussion_r172620679 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java --- @@ -27,6 +28,9 @@ * * Streaming queries

[GitHub] spark pull request #20752: [SPARK-23559][SS] Create StreamingDataWriterFacto...

2018-03-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20752#discussion_r172616831 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala --- @@ -54,7 +55,14 @@ case class

[GitHub] spark pull request #20752: [SPARK-23559][SS] Create StreamingDataWriterFacto...

2018-03-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20752#discussion_r172616601 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java --- @@ -27,6 +28,9 @@ * * Streaming queries

[GitHub] spark pull request #20752: [SPARK-23559][SS] Create StreamingDataWriterFacto...

2018-03-06 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20752#discussion_r172613993 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamingDataWriterFactory.java --- @@ -0,0 +1,51

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 Epoch ID is not a valid part of the logical place in a query for batch. I think we should separate batch and streaming, as they are already coming from different interfaces. There's no need to pass

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 > Data source writers need to be able to reason about what progress they've made, which is impossible in the streaming case if each epoch is its own disconnected query. I don't th

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 My question is: why can't we use a batch interface for batch and micro-batch (which behaves like batch) and add a separate streaming interface for continuous streaming? I see no reason to have epoch

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 Could the non-continuous streaming mode just use the batch interface, since each write is basically separate

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 @jose-torres, can you explain that more for me? Why would callers only use one interface but not the other? Wouldn't streaming use one and batch the other? Why would batch need to know about

[GitHub] spark issue #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFactory.

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20710 @tdas, thanks for letting us know. I'm really wondering if we should be using the same interfaces between batch and streaming. The epoch id strikes me as strange for data sources that won't support

[GitHub] spark pull request #20726: [SPARK-23574][CORE] Report SinglePartition in Dat...

2018-03-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20726#discussion_r172324207 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -46,34 +48,46 @@ case class

[GitHub] spark pull request #20726: [SPARK-23574][CORE] Report SinglePartition in Dat...

2018-03-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20726#discussion_r172320487 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -46,34 +48,46 @@ case class

[GitHub] spark pull request #20710: [SPARK-23559][SS] Add epoch ID to DataWriterFacto...

2018-03-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20710#discussion_r172319605 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java --- @@ -48,6 +48,9 @@ * same

[GitHub] spark issue #20726: [SPARK-23574][CORE] Report SinglePartition in DataSource...

2018-03-05 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20726 Looks good to me other than a minor point on the private `readerFactories` val. --- - To unsubscribe, e-mail: reviews-unsubscr

[GitHub] spark pull request #20726: [SPARK-23574][CORE] Report SinglePartition in Dat...

2018-03-05 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20726#discussion_r172317377 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala --- @@ -46,34 +48,46 @@ case class

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 Good point on nested types. I don't think heavy nesting is the usual case, but we can definitely improve the explain result in the long term by separating it out. Maybe just using a high-level type

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 My intent is just to advocate for clear feedback on the content of PRs. Good to hear your confidence in @mgaido91, and if he wants to work on a better explain, that's great too

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 > I think you can add the patch to your fork. My primary concern isn't the feature, although I do think it is useful. My concern is how we work with contributors. My worry h

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 > this is not urgent based on our release schedule. Marco is contributing this right now. It is a bad idea to ask contributors to show up in 4 months, if we don't have a better opt

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 I don't think it is a good idea to block a fix like this on a redesign of explain plans. If that redesign was currently being worked on and this made it more difficult, it might be reasonable

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 I don't see a compelling reason to block this work, which is well-defined and a reasonably small patch. What is here would help users with debugging. @gatorsmile, If you have an alternative

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 @mgaido91, I'm happy without another option. Let's add it in the future if we find that we need it, but not assume that we will. This change should give most of the information needed to debug types

[GitHub] spark pull request #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20692#discussion_r171626542 --- Diff: sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala --- @@ -94,7 +94,7 @@ case class

[GitHub] spark pull request #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20692#discussion_r171624644 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala --- @@ -106,7 +106,7 @@ abstract class Attribute

[GitHub] spark issue #20692: [SPARK-23531][SQL] Show attribute type in explain

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20692 I agree with @cloud-fan that the type information doesn't need to be in every node of the plan, just in the scans and generators. We want enough information that we can tell what types

[GitHub] spark issue #20647: [SPARK-23303][SQL] improve the explain result for data s...

2018-03-01 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20647 Looks like there's an unnecessary redact call in there and I found a couple of nits, but I think this is ready to go. +1

[GitHub] spark pull request #20647: [SPARK-23303][SQL] improve the explain result for...

2018-03-01 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/20647#discussion_r171618852 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StringFormat.scala --- @@ -0,0 +1,98

<    2   3   4   5   6   7   8   9   10   11   >