Thanks for the voice, Shixiong!
Thanks for sharing the use case of complete mode in practice. I agree
that's a valid use case where complete mode would help, but I'm unsure
enabling complete mode is the only way to deal with the use case.
1. Given it assumes pretty much small cardinality of the
Hey Jungtaek,
I totally agree with you about the issues of the complete mode you raised
here. However, not all streaming queries have unbounded states and
will grow quickly to a crazy state.
Actually, I found the complete mode is pretty useful when the states are
bounded and small. For example,
Let me share the effect on removing the incomplete and undocumented code
path. I manually tried out removing the code path and here's the change.
https://github.com/HeartSaVioR/spark/commit/aa53e9b1b33c0b8aec37704ad290b42ffb2962d8
1,120 lines deleted without hurting any existing streaming tests,
I think those are fair concerns, I was mostly just updating tests for RC2
and adding in "append" everywhere
Code like
spark.sql(s"SELECT a, b from $ks.test1")
.write
.format("org.apache.spark.sql.cassandra")
.options(Map("table" -> "test_insert1", "keyspace" -> ks))
.save()
Now fails at
The context on this is that it was confusing that the mode changed, which
introduced different behaviors for the same user code when moving from v1
to v2. Burak pointed this out and I agree that it's weird that if your
dependency changes from v1 to v2, your compiled Spark job starts appending
Hey Russell,
Great catch on the documentation. It seems out of date. I honestly am
against having different DataSources having different default SaveModes.
Users will have no clue if a DataSource implementation is V1 or V2. It
seems weird that the default value can change for something that I
While the ScalaDocs for DataFrameWriter say
/**
* Specifies the behavior when data or table already exists. Options include:
*
* `SaveMode.Overwrite`: overwrite the existing data.
* `SaveMode.Append`: append the data.
* `SaveMode.Ignore`: ignore the operation (i.e. no-op).
*
Okay, I took a look at the PR and I think it should be okay. The new
classes are unfortunately public, but are in catalyst which is considered
private. So this is the approach we discussed.
I'm fine with the commit, other than the fact that it violated ASF norms
Why was https://github.com/apache/spark/pull/28523 merged with a -1? We
discussed this months ago and concluded that it was a bad idea to introduce
a new v2 API that cannot have reliable behavior across sources.
The last time I checked that PR, the approach I discussed with Tathagata
was to not
Seems the priority of SPARK-31706 is incorrectly marked, and it's a blocker
now. The fix was merged just a few hours ago.
This should be a -1 for RC2.
On Wed, May 20, 2020 at 2:42 PM rickestcode
wrote:
> +1
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
>
+1
--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
11 matches
Mail list logo