Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r186119360
--- Diff:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveDDLSuite.scala
---
@@ -1354,7 +1354,8 @@ class HiveDDLSuite
val
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, I suspect that the problem is that comparison is different for
strings: `"17297598712"` is less than `"5"` wi
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@gatorsmile, are you happy committing this with the benchmark results?
@maropu, thanks for taking the time to add these benchmarks, it is really
great to have them so we can monitor
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@cloud-fan, actually I tried a lot of different queries yesterday,
including joins and aggregations. The only thing that didn't work was `collect`
for a `select * from t` because `SparkPlan` assumes
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, looking at the pushdown benchmark, it looks like ORC and Parquet
either both benefit or both do not benefit from pushdown. In some cases ORC is
much faster, which is due to the fact
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@maropu, are you sure about the INT and FLOAT columns? I think you might
have that assessment backwards. Here's the INT results from the PR gist:
```
SQL Single INT Column Scan
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@cloud-fan and @jose-torres: I looked at `explain codegen` for reading from
a Parquet table (with vectorized reads disabled) and it doesn't look like there
is a dependency on `UnsafeRow
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21143
@cloud-fan, that's kind of the point I was trying to make. It is too
difficult to do whole-stage codegen, but we could add a codegen filter before
whole-stage codegen. Why make the implementation
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r185310952
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
---
@@ -299,13 +299,13 @@ private[kafka010
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21143
@cloud-fan, union doesn't really help. I already have support for
mixed-formats working just fine. The format isn't the problem, it is filtering
(and a similar problem with projection). Parquet
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
@cloud-fan and @henryr, do you have an opinion about naming here?
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@mswit-databricks, I wouldn't worry about that. We've limited the length of
binary and string fields.
In the next version of Parquet, we're planning on releasing page indexes,
which
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21122
Thanks for pointing this out, @henryr. This looks like a good change to
support multiple catalogs.
I think it looks fine, other than exposing `unwrapped` to get the Hive
client. I think
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21122#discussion_r185138677
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/ExternalCatalog.scala
---
@@ -31,10 +30,16 @@ import
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21123#discussion_r185055954
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
---
@@ -213,6 +215,26 @@ case class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Is it necessary to block this commit on benchmarks? We know that it doesn't
degrade performance from the Parquet benchmarks and TPC-DS run. What do you
think needs to be done before this can move
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Yes, it is safe to use push-down for string columns in data written with
this version.
---
-
To unsubscribe, e-mail: reviews
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
I think `ReadTask` is fine. That name does not imply that you can use the
object itself to read, but it does correctly show that it is one task in a
larger operation. I think the name implies
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21145
@arunmahadevan, the problem is that the current naming is misleading. This
is not a factory (it only produces one specific reader) and it does not have
the same lifecycle as the write-side factory
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21143
Thanks for working on this, @cloud-fan! I was thinking about needing it
just recently so that data sources can delegate to Spark when needed.
I'll have a thorough look at it tomorrow
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Thank you @maropu!
What resources does the run require? Is it something we could create a
Jenkins job to run
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21145#discussion_r184235329
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
---
@@ -299,13 +299,13 @@ private[kafka010
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21118#discussion_r184216025
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaContinuousReader.scala
---
@@ -86,7 +87,7 @@ class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
The fix for PARQUET-686 was to suppress min/max stats. It is safe to push
filters, but those filters can't be used without the stats. 1.10.0 has the
correct stats and can use those filters
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r184156731
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
---
@@ -503,7 +503,7 @@ class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
There are two main reasons to update. First, the problem behind SPARK-17213
is fixed, hence the new min and max fields. Second, this updates the internal
byte array management, which is needed
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r184139736
--- Diff:
sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/InMemoryColumnarQuerySuite.scala
---
@@ -503,7 +503,7 @@ class
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21145
SPARK-24073: Rename DataReaderFactory to ReadTask.
## What changes were proposed in this pull request?
This reverses the changes in SPARK-23219, which renamed ReadTask
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
> Based on the previous upgrade (e.g., #13280 (comment)), we hit the
performance regressions when we upgrade Parquet and we did the revert at the
end.
I should point out that the regress
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Okay, I don't have the time to set up and run benchmarks without a stronger
case for this causing a regression (given the Parquet testing), but other
people should feel free to pick this up
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@cloud-fan, is there a Jenkins job to run it?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
Yeah, we should probably add a projection. It's probably only working
because the InternalRows that are produced are all UnsafeRow
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21118
@jose-torres, @cloud-fan, can you take a look at this? It updates the v2
API to use InternalRow by default.
---
-
To unsubscribe
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21118
SPARK-23325: Use InternalRow when reading with DataSourceV2.
## What changes were proposed in this pull request?
This updates the DataSourceV2 API to use InternalRow instead of Row
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Retest this please.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/2
Thanks for doing this as a follow-up. I had one minor comment, but
otherwise I'm +1. I see what you mean about using `PhysicalOperation` now. It
is slightly cleaner and I guess `PhysicalOperation
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/2#discussion_r183101013
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
---
@@ -114,11 +119,8 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20988#discussion_r182858087
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
---
@@ -129,35 +151,41 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21029#discussion_r182596471
--- Diff:
external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaMicroBatchReader.scala
---
@@ -146,7 +146,7 @@ private[kafka010] class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21029#discussion_r182596392
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceRDD.scala
---
@@ -17,52 +17,85 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21029#discussion_r182596274
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/streaming/MicroBatchReader.java
---
@@ -17,12 +17,12 @@
package
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21029#discussion_r182596153
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/DataReaderFactory.java
---
@@ -52,10 +56,46
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20988
@cloud-fan, I've added the test. Thanks for letting me know about
HiveCatalogMetrics, that's exactly what I needed
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
I backported the Hadoop zstd codec to 2.7.3 without much trouble. But
either way, I think that's a separate concern. This is about getting Parquet
updated, not about worrying whether users can
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20988#discussion_r182579387
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/OptimizeMetadataOnlyQuery.scala
---
@@ -129,35 +151,41 @@ case class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
@scottcarey, Parquet will use the compressors if they are available. You
can add them from an external Jar and it will work. LZ4 should also work out of
the box because it is included in Hadoop 2.7
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r182563063
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +59,159 @@ public final
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181899509
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181883674
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21071
Some metrics to convince ourselves that using the null scope has no
performance impact would be great.
---
-
To unsubscribe, e
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181864492
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21071
@devaraj-kavali, do you have any measurements to quantify how this impacts
overall performance? We would want to know before releasing this for use
because using HTrace means having it on all
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21060
I agree with what @srowen said:
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181826671
--- Diff: pom.xml ---
@@ -129,7 +129,7 @@
1.2.1
10.12.1.1
-1.8.2
+1.10.0
--- End diff --
I excluded
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/21070#discussion_r181535144
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedPlainValuesReader.java
---
@@ -63,115 +58,139 @@ public final
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/21070
Upstream benchmarks for buffer management changes are here:
https://github.com/apache/parquet-mr/pull/390#issuecomment-338505426
That doesn't show the GC benefit for smaller buffer
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/21070
SPARK-23972: Update Parquet to 1.10.0.
## What changes were proposed in this pull request?
This updates Parquet to 1.10.0 and updates the vectorized path for buffer
management changes
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r181509305
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -368,8 +368,7 @@ case class FileSourceScanExec
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20988
@cloud-fan or @gatorsmile, could you review this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179811255
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -368,8 +368,7 @@ case class FileSourceScanExec
GitHub user rdblue opened a pull request:
https://github.com/apache/spark/pull/20988
[SPARK-23877][SQL]: Use filter predicates to prune partitions in
metadata-only queries
## What changes were proposed in this pull request?
This updates the OptimizeMetadataOnlyQuery rule
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179521697
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -1185,6 +1185,13 @@ object SQLConf {
.stringConf
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179521502
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala
---
@@ -213,6 +215,26 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179521100
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FilePartitionUtil.scala
---
@@ -0,0 +1,225 @@
+/*
+ * Licensed
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179520643
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/orc/OrcDataSourceV2.scala
---
@@ -0,0 +1,194 @@
+/*
+ * Licensed
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179516352
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/FileReaderFactory.scala
---
@@ -0,0 +1,50 @@
+/*
+ * Licensed
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179514000
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala
---
@@ -368,8 +368,7 @@ case class FileSourceScanExec
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20933#discussion_r179513582
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala ---
@@ -187,6 +189,14 @@ class DataFrameReader private[sql](sparkSession
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20397
@cloud-fan, can we revert this?
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20579
I agree, we should probably add a check for storing a DataFrame with no
columns for now. This is normally caught by the pre-insert rules, but since the
table is getting "created" in
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20752
> Right now, we're still at the point where we aren't quite sure what a
streaming API needs to look like. We're starting from basically ground zero;
the V1 streaming API just throws a DataFr
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20752
Thanks for the clear summary, @cloud-fan. I think we want to make it easy
to support batch, and then easy to reuse those internals to support streaming
by adding new mix-in interfaces. Streaming
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20752#discussion_r172621472
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamingDataWriterFactory.java
---
@@ -0,0 +1,51
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20752#discussion_r172620679
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java
---
@@ -27,6 +28,9 @@
*
* Streaming queries
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20752#discussion_r172616831
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/WriteToDataSourceV2.scala
---
@@ -54,7 +55,14 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20752#discussion_r172616601
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamWriter.java
---
@@ -27,6 +28,9 @@
*
* Streaming queries
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20752#discussion_r172613993
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/streaming/StreamingDataWriterFactory.java
---
@@ -0,0 +1,51
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
Epoch ID is not a valid part of the logical place in a query for batch. I
think we should separate batch and streaming, as they are already coming from
different interfaces. There's no need to pass
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
> Data source writers need to be able to reason about what progress they've
made, which is impossible in the streaming case if each epoch is its own
disconnected query.
I don't th
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
My question is: why can't we use a batch interface for batch and
micro-batch (which behaves like batch) and add a separate streaming interface
for continuous streaming? I see no reason to have epoch
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
Could the non-continuous streaming mode just use the batch interface, since
each write is basically separate
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
@jose-torres, can you explain that more for me? Why would callers only use
one interface but not the other? Wouldn't streaming use one and batch the
other? Why would batch need to know about
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20710
@tdas, thanks for letting us know. I'm really wondering if we should be
using the same interfaces between batch and streaming. The epoch id strikes me
as strange for data sources that won't support
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20726#discussion_r172324207
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
---
@@ -46,34 +48,46 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20726#discussion_r172320487
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
---
@@ -46,34 +48,46 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20710#discussion_r172319605
--- Diff:
sql/core/src/main/java/org/apache/spark/sql/sources/v2/writer/DataWriterFactory.java
---
@@ -48,6 +48,9 @@
* same
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20726
Looks good to me other than a minor point on the private `readerFactories`
val.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20726#discussion_r172317377
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2ScanExec.scala
---
@@ -46,34 +48,46 @@ case class
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
Good point on nested types. I don't think heavy nesting is the usual case,
but we can definitely improve the explain result in the long term by separating
it out. Maybe just using a high-level type
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
My intent is just to advocate for clear feedback on the content of PRs.
Good to hear your confidence in @mgaido91, and if he wants to work on a better
explain, that's great too
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
> I think you can add the patch to your fork.
My primary concern isn't the feature, although I do think it is useful. My
concern is how we work with contributors. My worry h
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
> this is not urgent based on our release schedule.
Marco is contributing this right now. It is a bad idea to ask contributors
to show up in 4 months, if we don't have a better opt
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
I don't think it is a good idea to block a fix like this on a redesign of
explain plans. If that redesign was currently being worked on and this made it
more difficult, it might be reasonable
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
I don't see a compelling reason to block this work, which is well-defined
and a reasonably small patch. What is here would help users with debugging.
@gatorsmile, If you have an alternative
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
@mgaido91, I'm happy without another option. Let's add it in the future if
we find that we need it, but not assume that we will. This change should give
most of the information needed to debug types
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20692#discussion_r171626542
--- Diff:
sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/CreateHiveTableAsSelectCommand.scala
---
@@ -94,7 +94,7 @@ case class
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20692#discussion_r171624644
--- Diff:
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
---
@@ -106,7 +106,7 @@ abstract class Attribute
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20692
I agree with @cloud-fan that the type information doesn't need to be in
every node of the plan, just in the scans and generators. We want enough
information that we can tell what types
Github user rdblue commented on the issue:
https://github.com/apache/spark/pull/20647
Looks like there's an unnecessary redact call in there and I found a couple
of nits, but I think this is ready to go.
+1
Github user rdblue commented on a diff in the pull request:
https://github.com/apache/spark/pull/20647#discussion_r171618852
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DataSourceV2StringFormat.scala
---
@@ -0,0 +1,98
601 - 700 of 1330 matches
Mail list logo