Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Jungtaek Lim
Thanks for the voice, Shixiong! Thanks for sharing the use case of complete mode in practice. I agree that's a valid use case where complete mode would help, but I'm unsure enabling complete mode is the only way to deal with the use case. 1. Given it assumes pretty much small cardinality of the

Re: [DISCUSS] "complete" streaming output mode

2020-05-20 Thread Shixiong(Ryan) Zhu
Hey Jungtaek, I totally agree with you about the issues of the complete mode you raised here. However, not all streaming queries have unbounded states and will grow quickly to a crazy state. Actually, I found the complete mode is pretty useful when the states are bounded and small. For example,

Re: [DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-05-20 Thread Jungtaek Lim
Let me share the effect on removing the incomplete and undocumented code path. I manually tried out removing the code path and here's the change. https://github.com/HeartSaVioR/spark/commit/aa53e9b1b33c0b8aec37704ad290b42ffb2962d8 1,120 lines deleted without hurting any existing streaming tests,

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Russell Spitzer
I think those are fair concerns, I was mostly just updating tests for RC2 and adding in "append" everywhere Code like spark.sql(s"SELECT a, b from $ks.test1") .write .format("org.apache.spark.sql.cassandra") .options(Map("table" -> "test_insert1", "keyspace" -> ks)) .save() Now fails at

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Ryan Blue
The context on this is that it was confusing that the mode changed, which introduced different behaviors for the same user code when moving from v1 to v2. Burak pointed this out and I agree that it's weird that if your dependency changes from v1 to v2, your compiled Spark job starts appending

Re: [DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Burak Yavuz
Hey Russell, Great catch on the documentation. It seems out of date. I honestly am against having different DataSources having different default SaveModes. Users will have no clue if a DataSource implementation is V1 or V2. It seems weird that the default value can change for something that I

[DatasourceV2] Default Mode for DataFrameWriter not Dependent on DataSource Version

2020-05-20 Thread Russell Spitzer
While the ScalaDocs for DataFrameWriter say /** * Specifies the behavior when data or table already exists. Options include: * * `SaveMode.Overwrite`: overwrite the existing data. * `SaveMode.Append`: append the data. * `SaveMode.Ignore`: ignore the operation (i.e. no-op). *

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue
Okay, I took a look at the PR and I think it should be okay. The new classes are unfortunately public, but are in catalyst which is considered private. So this is the approach we discussed. I'm fine with the commit, other than the fact that it violated ASF norms

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Ryan Blue
Why was https://github.com/apache/spark/pull/28523 merged with a -1? We discussed this months ago and concluded that it was a bad idea to introduce a new v2 API that cannot have reliable behavior across sources. The last time I checked that PR, the approach I discussed with Tathagata was to not

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread Wenchen Fan
Seems the priority of SPARK-31706 is incorrectly marked, and it's a blocker now. The fix was merged just a few hours ago. This should be a -1 for RC2. On Wed, May 20, 2020 at 2:42 PM rickestcode wrote: > +1 > > > > -- > Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ > >

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-20 Thread rickestcode
+1 -- Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/ - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org