[GitHub] spark pull request #21574: [SPARK-24478][SQL][followup] Move projection and ...

2018-06-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21574#discussion_r196173289 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,51 +17,115 @@ package

[GitHub] spark pull request #21574: [SPARK-24478][SQL][followup] Move projection and ...

2018-06-18 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21574#discussion_r196172875 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -105,117 +105,57 @@ case class

[GitHub] spark issue #21574: [SPARK-24478][SQL][followup] Move projection and filter ...

2018-06-18 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21574 @rxin, we can also add the second pushdown (in the stats visitor) to get better stats with a property to turn it on or off. We're going to add it back in our branch anyway

[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21503 @cloud-fan, tests are passing for c8517e145b1a460a8be07164c17ce20b1db86659, which has all of the functional changes. The Jenkins job ran out of memory for the last commit, but the only change

[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 > IMO your change is the right fix, not just a workaround @squito, part of the problem is that the output commit coordinator -- that ensures only one attempt of a task commits -- rel

[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 > So the problem here is, when we retry a stage, Spark doesn't kill the tasks of the old stage and just launch tasks for the new stage I think that's something that should be fi

[GitHub] spark issue #21558: [SPARK-24552][SQL] Use task ID instead of attempt number...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21558 @cloud-fan, this is a work-around for SPARK-24552. I'm not sure the right way to fix this besides fixing the scheduler so that it doesn't use task attempt numbers twice, but I think this works

[GitHub] spark pull request #21558: [SPARK-24552][SQL] Use task ID instead of attempt...

2018-06-13 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21558 [SPARK-24552][SQL] Use task ID instead of attempt number for v2 writes. ## What changes were proposed in this pull request? This passes the unique task attempt id instead of attempt number

[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21503 Updated the stats interface. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r195173700 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala --- @@ -32,79 +31,35 @@ import

[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-13 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21503 Retest this please. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r195138932 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194875888 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194861645 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194841328 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-12 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21503#discussion_r194829647 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala --- @@ -17,15 +17,56 @@ package

[GitHub] spark issue #21503: [SPARK-24478][SQL] Move projection and filter push down ...

2018-06-12 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21503 @cloud-fan, this is the PR for moving push-down to the physical plan conversion and reporting the stats correctly. Sorry for the confusion because I sent a link to just the second commit

[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-06-06 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21319 Here's the commit with my changes to support v2 stats in the visitor, sorry it took so long for me to find the time! https://github.com/apache/spark/pull/21503/commits

[GitHub] spark pull request #21503: [SPARK-24478][SQL] Move projection and filter pus...

2018-06-06 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21503 [SPARK-24478][SQL] Move projection and filter push down to physical conversion ## What changes were proposed in this pull request? This removes the v2 optimizer rule for push-down

[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-05-25 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21319 I'll open a PR to demonstrate what I'm proposing. It wouldn't change the plan, it would use a reader and push-down to report stats correctly as a temporary fix. I'm -1 on this PR. I'll have

[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-05-25 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21308#discussion_r191021410 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21308#discussion_r190723800 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21308#discussion_r190712870 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21305#discussion_r190711154 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -344,6 +344,36 @@ case class Join

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21305#discussion_r190685985 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -344,6 +344,36 @@ case class Join

[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21295 Thanks for merging, @cloud-fan, and thanks for the reviews everyone! I've opened PARQUET-1309 to track the Parquet fix for the properties

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190683568 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190683035 --- Diff: python/pyspark/sql/dataframe.py --- @@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False): name | Bob

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190682693 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-24 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r190682484 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM

[GitHub] spark issue #21411: [SPARK-24367][SQL]Parquet: use JOB_SUMMARY_LEVEL instead...

2018-05-24 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21411 I'm fine with whatever changes you want to make here because we don't use Parquet summary files. As always, I'll note that I think it is a bad idea to support the summary files in general

[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-05-23 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21319 @cloud-fan, what about adding support for v2 pushdown in the stats visitor instead? Here's the idea: when the visitor hits a `Filter` or a `Project`, it tries to match the plan using

[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-23 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21295 Thanks for looking at this, everyone. Sorry for the delay in updating it, I'm currently out on paternity leave and don't have a lot of time. I'll get an update pushed sometime soon though

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-23 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r190379077 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -147,7 +147,8 @@ public

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-23 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r190378887 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -225,7 +226,8 @@ protected

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-23 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r190377736 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,18 @@ class

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r189748585 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,18 @@ class

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r189748452 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -147,7 +147,8 @@ public

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r189748419 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,18 @@ class

[GitHub] spark issue #21242: [SPARK-23657][SQL] Document and expose the internal data...

2018-05-21 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21242 @rxin, after a few attempts at cleanly exposing these classes, I tend to agree that it isn't going to be worth it. But the problem is that we need an API for data sources to produce. What's worse

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189733089 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala --- @@ -237,9 +238,13 @@ class Dataset[T] private[sql]( * @param truncate If set

[GitHub] spark pull request #21370: [SPARK-24215][PySpark] Implement _repr_html_ for ...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21370#discussion_r189732998 --- Diff: docs/configuration.md --- @@ -456,6 +456,29 @@ Apart from these, the following properties are also available, and may be useful from JVM

[GitHub] spark issue #21370: [SPARK-24215][PySpark] Implement _repr_html_ for datafra...

2018-05-21 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21370 @rxin, `__repr__` is the equivalent for ipython and the python REPL. `_repr_html_` is the convention used by jupyter to replicate `__repr__` in notebooks with HTML output

[GitHub] spark issue #21326: [SPARK-24275][SQL] Revise doc comments in InputPartition

2018-05-21 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21326 Belated +1. In the future, I think it is better for readable docs to pluralize linked terms as I suggested in a comment

[GitHub] spark pull request #21326: [SPARK-24275][SQL] Revise doc comments in InputPa...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21326#discussion_r189730557 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.java --- @@ -59,21 +59,21 @@ * Returns the actual schema

[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21308#discussion_r189730152 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java --- @@ -0,0 +1,51 @@ +/* + * Licensed to the Apache Software

[GitHub] spark issue #21319: [SPARK-24267][SQL] explicitly keep DataSourceReader in D...

2018-05-21 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21319 @cloud-fan, is this necessary if we make the stats changes necessary to move push-down to when the logical plan is converted to the physical plan? I don't think it will be because we don't create

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-05-21 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21305#discussion_r189728174 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala --- @@ -344,6 +344,36 @@ case class Join

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r187809226 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala --- @@ -879,6 +879,18 @@ class

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-13 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21295#discussion_r187809171 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java --- @@ -147,7 +147,8 @@ public

[GitHub] spark pull request #21308: SPARK-24253: Add DeleteSupport mix-in for DataSou...

2018-05-11 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21308 SPARK-24253: Add DeleteSupport mix-in for DataSourceV2. ## What changes were proposed in this pull request? Adds `DeleteSupport` mix-in for `DataSourceV2`. This mix-in provides a method

[GitHub] spark issue #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog s...

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21306 @henryr, @cloud-fan, @marmbrus, here's a first pass at adding a catalog mix-in to the v2 API. Please have a look and leave comments on what you'd like to change. One thing that I don't

[GitHub] spark pull request #21306: [SPARK-24252][SQL] Add DataSourceV2 mix-in for ca...

2018-05-11 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21306 [SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog support. ## What changes were proposed in this pull request? This adds a mix-in to `DataSourceV2` that allows implementations

[GitHub] spark issue #20488: [SPARK-23321][SQL]: Validate datasource v2 writes

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/20488 Closing this in favor of #21305. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e

[GitHub] spark pull request #20488: [SPARK-23321][SQL]: Validate datasource v2 writes

2018-05-11 Thread rdblue
Github user rdblue closed the pull request at: https://github.com/apache/spark/pull/20488 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #21305: [SPARK-24251][SQL] Add AppendData logical plan.

2018-05-11 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21305 [SPARK-24251][SQL] Add AppendData logical plan. ## What changes were proposed in this pull request? This adds a new logical plan, AppendData, that was proposed in SPARK-23521: Standardize

[GitHub] spark issue #21302: [SPARK-23852][SQL] Upgrade to Parquet 1.8.3

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21302 @gatorsmile, that is correct. https://github.com/apache/parquet-mr/commits/apache-parquet-1.8.3 --- - To unsubscribe, e-mail

[GitHub] spark issue #21302: [SPARK-23852][SQL] Upgrade to Parquet 1.8.3

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21302 +1 when tests are passing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #21302: [SPARK-23852][SQL] Upgrade to Parquet 1.8.3

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21302 @henryr, why not backport the test case in this commit? I don't think it makes sense to separate the two because that test verifies this commit

[GitHub] spark issue #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase w...

2018-05-11 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21295 @gatorsmile and @dongjoon-hyun, can you take a look at this? It is a commit I missed for the Parquet 1.10.0 update

[GitHub] spark pull request #21118: SPARK-23325: Use InternalRow when reading with Da...

2018-05-10 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21118#discussion_r187465762 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanUnsafeRow.java --- @@ -1,46 +0,0 @@ -/* - * Licensed

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-10 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan, @jose-torres, I think this is ready for final review. I've rebased on top of the rename to `InputPartition`. I've also added a projection when this produces a physical plan

[GitHub] spark pull request #21295: [SPARK-24230][SQL] Fix SpecificParquetRecordReade...

2018-05-10 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21295 [SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary filters. ## What changes were proposed in this pull request? I missed this commit when preparing #21070

[GitHub] spark issue #21230: [SPARK-24172][SQL] we should not apply operator pushdown...

2018-05-10 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21230 +1 (assuming tests pass) --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to InputPar...

2018-05-09 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 Thanks @jose-torres! I appreciate not blocking this commit on those changes, since it would be difficult to keep this up to date from the other paths changing, while we discussed what to call

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to I...

2018-05-09 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r187085278 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.java --- @@ -76,5 +76,5 @@ * If this method fails

[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-05-08 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 @cloud-fan, here's a commit that demonstrates the idea and implementation: https://github.com/rdblue/spark/commit/b41eb1ef2af38c510f5426a096c586a93e4a5556 That adds `residualFilters

[GitHub] spark issue #21143: [SPARK-24072][SQL] clearly define pushed filters

2018-05-08 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21143 I think we would only need `DataSourceReader` to implement `SupportsPushDownFilter` because it is primarily used to push filters to the data source and the query's filters are determined while

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to InputPar...

2018-05-08 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 @cloud-fan, I've updated this PR to use `InputPartition` and similar names, since we seem to have consensus around

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-08 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186793493 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-08 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186789552 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java --- @@ -619,32 +608,37 @@ private int

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-08 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186785996 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-08 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 > If we want to go this way, I think we should fully bring back #10511 to make this contract explicitly, i.e. which operator produce unsafe row and which operator only accepts unsafe row as in

[GitHub] spark issue #21230: [SPARK-24172][SQL] we should not apply operator pushdown...

2018-05-08 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21230 Sounds good to me. Lets plan on getting this one in to fix the current problem, and commit the other approach when stats are fixed

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r186587220 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java --- @@ -22,20 +22,20 @@ import

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan: This PR is also related to #21262 because that PR updates the conversion from logical to physical plan and handles projections and filtering. We could modify that strategy to always

[GitHub] spark issue #21230: [SPARK-24172][SQL] we should not apply operator pushdown...

2018-05-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21230 @cloud-fan, I opened #21262 that is similar to this, but does pushdown when converting to a physical plan. You might like that as an alternative because it cleans up `DataSourceV2Relation` quite

[GitHub] spark pull request #21262: [SPARK-24172][SQL]: Push projection and filters o...

2018-05-07 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21262 [SPARK-24172][SQL]: Push projection and filters once when converting to physical plan. ## What changes were proposed in this pull request? This removes `PruneFileSourcePartitions

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r186498369 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java --- @@ -22,20 +22,20 @@ import

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 > We expect data source to produce `ColumnarBatch` for better performance, and the row interface performance is not that important. I disagree. The vectorized path isn't used for all Parq

[GitHub] spark issue #21230: [SPARK-24172][SQL] we should not apply operator pushdown...

2018-05-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21230 So it was the use of `transformUp` that caused this rules to match multiple times, right? In that case, would it make more sense to do what @marmbrus suggested in the immutable plan PR and make

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186469029 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala --- @@ -345,7 +345,7 @@ object SQLConf { "snappy, gzip

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186468896 --- Diff: dev/deps/spark-deps-hadoop-2.7 --- @@ -163,13 +163,13 @@ orc-mapreduce-1.4.3-nohive.jar oro-2.0.8.jar osgi-resource-locator-1.0.1.jar

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186464674 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final

[GitHub] spark pull request #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10....

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21070#discussion_r186464557 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java --- @@ -63,115 +59,157 @@ public final

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-07 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 > Actually the `SupportsScanUnsafeRow` is only there to avoid perf regression for migrating file sources. If you think that's not a good public API, we can move it to internal package and only

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-07 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186452677 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186240975 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark pull request #21242: [SPARK-23657][SQL] Document and expose the intern...

2018-05-04 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21242 [SPARK-23657][SQL] Document and expose the internal data API ## What changes were proposed in this pull request? This makes the `InternalRow`, `ArrayData`, and `MapData` classes public

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186205362 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark issue #21237: [SPARK-23325][WIP] Test parquet returning internal row

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21237 This is a follow-up to #21118. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail

[GitHub] spark pull request #21237: [SPARK-23325][WIP] Test parquet returning interna...

2018-05-04 Thread rdblue
GitHub user rdblue opened a pull request: https://github.com/apache/spark/pull/21237 [SPARK-23325][WIP] Test parquet returning internal row ## What changes were proposed in this pull request? This updates `ParquetFileFormat` to return `InternalRow` instead of `UnsafeRow

[GitHub] spark pull request #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to R...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21145#discussion_r186163175 --- Diff: sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java --- @@ -22,20 +22,20 @@ import

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to ReadTask...

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 @gatorsmile, the Spark UI has used the term "task" for years to refer to the same thing. I don't think it is unreasonable to use the same term. ![tasks](ht

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186152832 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 I just did a performance test based on our 2.1.1 and a real table. I tested a full scan of an hour of data with a single data filter. The scan had 13,083 tasks and read 1084.8 GB. I used

[GitHub] spark pull request #21122: [SPARK-24017] [SQL] Refactor ExternalCatalog to b...

2018-05-04 Thread rdblue
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/21122#discussion_r186138866 --- Diff: sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala --- @@ -1354,7 +1354,8 @@ class HiveDDLSuite val

[GitHub] spark issue #21070: [SPARK-23972][BUILD][SQL] Update Parquet to 1.10.0.

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21070 @maropu, I'd recommend looking at the Parquet files using [`parquet-cli`](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22parquet-cli%22) to see if you're getting reasonable min/max stats for your

[GitHub] spark issue #21118: SPARK-23325: Use InternalRow when reading with DataSourc...

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21118 @cloud-fan, let me clarify what I'm getting at here. It appears that Spark makes at least one copy of data to unsafe when reading any Parquet row. If the projection includes partition

[GitHub] spark issue #21145: [SPARK-24073][SQL]: Rename DataReaderFactory to ReadTask...

2018-05-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/21145 @gengliangwang, we can follow up with a rename for the streaming classes that already use this API. But there is no need to do that right now and make this commit larger. I think I've

<    1   2   3   4   5   6   7   8   9   10   >