Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21574#discussion_r196173289
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,51 +17,115 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21574#discussion_r196172875
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
---
@@ -105,117 +105,57 @@ case class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21574
@rxin, we can also add the second pushdown (in the stats visitor) to get
better stats with a property to turn it on or off. We're going to add it back
in our branch anyway
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21503
@cloud-fan, tests are passing for c8517e145b1a460a8be07164c17ce20b1db86659,
which has all of the functional changes. The Jenkins job ran out of memory for
the last commit, but the only change
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21558
> IMO your change is the right fix, not just a workaround
@squito, part of the problem is that the output commit coordinator -- that
ensures only one attempt of a task commits -- rel
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21558
> So the problem here is, when we retry a stage, Spark doesn't kill the
tasks of the old stage and just launch tasks for the new stage
I think that's something that should be fi
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21558
@cloud-fan, this is a work-around for SPARK-24552. I'm not sure the right
way to fix this besides fixing the scheduler so that it doesn't use task
attempt numbers twice, but I think this works
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21558
[SPARK-24552][SQL] Use task ID instead of attempt number for v2 writes.
## What changes were proposed in this pull request?
This passes the unique task attempt id instead of attempt number
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21503
Updated the stats interface.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r195173700
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Relation.scala
---
@@ -32,79 +31,35 @@ import
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21503
Retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r195138932
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,15 +17,56 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r194875888
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,15 +17,56 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r194861645
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,15 +17,56 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r194841328
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,15 +17,56 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21503#discussion_r194829647
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2Strategy.scala
---
@@ -17,15 +17,56 @@
package
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21503
@cloud-fan, this is the PR for moving push-down to the physical plan
conversion and reporting the stats correctly. Sorry for the confusion because I
sent a link to just the second commit
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21319
Here's the commit with my changes to support v2 stats in the visitor, sorry
it took so long for me to find the time!
https://github.com/apache/spark/pull/21503/commits
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21503
[SPARK-24478][SQL] Move projection and filter push down to physical
conversion
## What changes were proposed in this pull request?
This removes the v2 optimizer rule for push-down
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21319
I'll open a PR to demonstrate what I'm proposing. It wouldn't change the
plan, it would use a reader and push-down to report stats correctly as a
temporary fix.
I'm -1 on this PR. I'll have
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21308#discussion_r191021410
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21308#discussion_r190723800
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21308#discussion_r190712870
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21305#discussion_r190711154
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
---
@@ -344,6 +344,36 @@ case class Join
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21305#discussion_r190685985
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
---
@@ -344,6 +344,36 @@ case class Join
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21295
Thanks for merging, @cloud-fan, and thanks for the reviews everyone! I've
opened PARQUET-1309 to track the Parquet fix for the properties
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r190683568
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False):
name | Bob
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r190683035
--- Diff: python/pyspark/sql/dataframe.py ---
@@ -347,13 +347,30 @@ def show(self, n=20, truncate=True, vertical=False):
name | Bob
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r190682693
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also
available, and may be useful
from JVM
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r190682484
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also
available, and may be useful
from JVM
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21411
I'm fine with whatever changes you want to make here because we don't use
Parquet summary files.
As always, I'll note that I think it is a bad idea to support the summary
files in general
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21319
@cloud-fan, what about adding support for v2 pushdown in the stats visitor
instead?
Here's the idea: when the visitor hits a `Filter` or a `Project`, it tries
to match the plan using
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21295
Thanks for looking at this, everyone. Sorry for the delay in updating it,
I'm currently out on paternity leave and don't have a lot of time. I'll get an
update pushed sometime soon though
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r190379077
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
---
@@ -147,7 +147,8 @@ public
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r190378887
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
---
@@ -225,7 +226,8 @@ protected
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r190377736
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
---
@@ -879,6 +879,18 @@ class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r189748585
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
---
@@ -879,6 +879,18 @@ class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r189748452
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
---
@@ -147,7 +147,8 @@ public
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r189748419
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
---
@@ -879,6 +879,18 @@ class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21242
@rxin, after a few attempts at cleanly exposing these classes, I tend to
agree that it isn't going to be worth it. But the problem is that we need an
API for data sources to produce. What's worse
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r189733089
--- Diff: sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala ---
@@ -237,9 +238,13 @@ class Dataset[T] private[sql](
* @param truncate If set
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21370#discussion_r189732998
--- Diff: docs/configuration.md ---
@@ -456,6 +456,29 @@ Apart from these, the following properties are also
available, and may be useful
from JVM
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21370
@rxin, `__repr__` is the equivalent for ipython and the python REPL.
`_repr_html_` is the convention used by jupyter to replicate `__repr__` in
notebooks with HTML output
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21326
Belated +1.
In the future, I think it is better for readable docs to pluralize linked
terms as I suggested in a comment
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21326#discussion_r189730557
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.java
---
@@ -59,21 +59,21 @@
* Returns the actual schema
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21308#discussion_r189730152
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/DeleteSupport.java ---
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21319
@cloud-fan, is this necessary if we make the stats changes necessary to
move push-down to when the logical plan is converted to the physical plan? I
don't think it will be because we don't create
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21305#discussion_r189728174
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala
---
@@ -344,6 +344,36 @@ case class Join
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r187809226
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala
---
@@ -879,6 +879,18 @@ class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21295#discussion_r187809171
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java
---
@@ -147,7 +147,8 @@ public
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21308
SPARK-24253: Add DeleteSupport mix-in for DataSourceV2.
## What changes were proposed in this pull request?
Adds `DeleteSupport` mix-in for `DataSourceV2`. This mix-in provides a
method
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21306
@henryr, @cloud-fan, @marmbrus, here's a first pass at adding a catalog
mix-in to the v2 API. Please have a look and leave comments on what you'd like
to change.
One thing that I don't
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21306
[SPARK-24252][SQL] Add DataSourceV2 mix-in for catalog support.
## What changes were proposed in this pull request?
This adds a mix-in to `DataSourceV2` that allows implementations
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20488
Closing this in favor of #21305.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user rdblue closed the pull request at:
https://github.com/apache/spark/pull/20488
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21305
[SPARK-24251][SQL] Add AppendData logical plan.
## What changes were proposed in this pull request?
This adds a new logical plan, AppendData, that was proposed in SPARK-23521:
Standardize
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21302
@gatorsmile, that is correct.
https://github.com/apache/parquet-mr/commits/apache-parquet-1.8.3
---
-
To unsubscribe, e-mail
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21302
+1 when tests are passing.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21302
@henryr, why not backport the test case in this commit? I don't think it
makes sense to separate the two because that test verifies this commit
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21295
@gatorsmile and @dongjoon-hyun, can you take a look at this? It is a commit
I missed for the Parquet 1.10.0 update
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21118#discussion_r187465762
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/SupportsScanUnsafeRow.java
---
@@ -1,46 +0,0 @@
-/*
- * Licensed
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@cloud-fan, @jose-torres, I think this is ready for final review. I've
rebased on top of the rename to `InputPartition`.
I've also added a projection when this produces a physical plan
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21295
[SPARK-24230][SQL] Fix SpecificParquetRecordReaderBase with dictionary
filters.
## What changes were proposed in this pull request?
I missed this commit when preparing #21070
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21230
+1 (assuming tests pass)
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
Thanks @jose-torres! I appreciate not blocking this commit on those
changes, since it would be difficult to keep this up to date from the other
paths changing, while we discussed what to call
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r187085278
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataSourceReader.java
---
@@ -76,5 +76,5 @@
* If this method fails
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21143
@cloud-fan, here's a commit that demonstrates the idea and implementation:
https://github.com/rdblue/spark/commit/b41eb1ef2af38c510f5426a096c586a93e4a5556
That adds `residualFilters
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21143
I think we would only need `DataSourceReader` to implement
`SupportsPushDownFilter` because it is primarily used to push filters to the
data source and the query's filters are determined while
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
@cloud-fan, I've updated this PR to use `InputPartition` and similar names,
since we seem to have consensus around
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186793493
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +59,157 @@ public final
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186789552
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedRleValuesReader.java
---
@@ -619,32 +608,37 @@ private int
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186785996
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +59,157 @@ public final
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
> If we want to go this way, I think we should fully bring back #10511 to
make this contract explicitly, i.e. which operator produce unsafe row and which
operator only accepts unsafe row as in
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21230
Sounds good to me. Lets plan on getting this one in to fix the current
problem, and commit the other approach when stats are fixed
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r186587220
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -22,20 +22,20 @@
import
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@cloud-fan: This PR is also related to #21262 because that PR updates the
conversion from logical to physical plan and handles projections and filtering.
We could modify that strategy to always
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21230
@cloud-fan, I opened #21262 that is similar to this, but does pushdown when
converting to a physical plan. You might like that as an alternative because it
cleans up `DataSourceV2Relation` quite
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21262
[SPARK-24172][SQL]: Push projection and filters once when converting to
physical plan.
## What changes were proposed in this pull request?
This removes `PruneFileSourcePartitions
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r186498369
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -22,20 +22,20 @@
import
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
> We expect data source to produce `ColumnarBatch` for better performance,
and the row interface performance is not that important.
I disagree. The vectorized path isn't used for all Parq
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21230
So it was the use of `transformUp` that caused this rules to match multiple
times, right? In that case, would it make more sense to do what @marmbrus
suggested in the immutable plan PR and make
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186469029
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -345,7 +345,7 @@ object SQLConf {
"snappy, gzip
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186468896
--- Diff: dev/deps/spark-deps-hadoop-2.7 ---
@@ -163,13 +163,13 @@ orc-mapreduce-1.4.3-nohive.jar
oro-2.0.8.jar
osgi-resource-locator-1.0.1.jar
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186464674
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +59,157 @@ public final
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r186464557
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +59,157 @@ public final
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
> Actually the `SupportsScanUnsafeRow` is only there to avoid perf
regression for migrating file sources. If you think that's not a good public
API, we can move it to internal package and only
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186452677
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186240975
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21242
[SPARK-23657][SQL] Document and expose the internal data API
## What changes were proposed in this pull request?
This makes the `InternalRow`, `ArrayData`, and `MapData` classes public
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186205362
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21237
This is a follow-up to #21118.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21237
[SPARK-23325][WIP] Test parquet returning internal row
## What changes were proposed in this pull request?
This updates `ParquetFileFormat` to return `InternalRow` instead of
`UnsafeRow
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r186163175
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
@@ -22,20 +22,20 @@
import
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
@gatorsmile, the Spark UI has used the term "task" for years to refer to
the same thing. I don't think it is unreasonable to use the same term.
![tasks](ht
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186152832
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
I just did a performance test based on our 2.1.1 and a real table. I tested
a full scan of an hour of data with a single data filter.
The scan had 13,083 tasks and read 1084.8 GB. I used
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186138866
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, I'd recommend looking at the Parquet files using
[`parquet-cli`](http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22parquet-cli%22)
to see if you're getting reasonable min/max stats for your
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@cloud-fan, let me clarify what I'm getting at here.
It appears that Spark makes at least one copy of data to unsafe when
reading any Parquet row. If the projection includes partition
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
@gengliangwang, we can follow up with a rename for the streaming classes
that already use this API. But there is no need to do that right now and make
this commit larger.
I think I've
501 - 600 of 1330 matches
Mail list logo