Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Liang-Chi Hsieh


Thanks all for the responses!

Based on these responses, I think we can go forward with the PR. I will put
the new config in the migration guide. Please help review the PR if you have
more comments.

Thank you!


Yuanjian Li wrote
> Already +1 in the PR. It would be great to mention the new config in the
> SS
> migration guide.
> 
> Ryan Blue 

> rblue@.com

>  于2020年11月11日周三 上午7:48写道:
> 
>> +1, I agree with Tom.
>>
>> On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 

> dongjoon.hyun@

> 
>> wrote:
>>
>>> +1 for Apache Spark 3.1.0.
>>>
>>> Bests,
>>> Dongjoon.
>>>
>>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 

> tgraves_cs@.com

> 
>>> wrote:
>>>
 +1 since its a correctness issue, I think its ok to change the behavior
 to make sure the user is aware of it and let them decide.

 Tom

 On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
 

> viirya@

>> wrote:


 Hi devs,

 In Spark structured streaming, chained stateful operators possibly
 produces
 incorrect results under the global watermark. SPARK-33259
 (https://issues.apache.org/jira/browse/SPARK-33259) has an example
 demostrating what the correctness issue could be.

 Currently we don't prevent users running such queries. Because the
 possible
 correctness in chained stateful operators in streaming query is not
 straightforward for users. From users perspective, it will possibly be
 considered as a Spark bug like SPARK-33259. It is also possible the
 worse
 case, users are not aware of the correctness issue and use wrong
 results.

 IMO, it is better to disable such queries and let users choose to run
 the
 query if they understand there is such risk, instead of implicitly
 running
 the query and let users to find out correctness issue by themselves.

 I would like to propose to disable the streaming query with possible
 correctness issue in chained stateful operators. The behavior can be
 controlled by a SQL config, so if users understand the risk and still
 want
 to run the query, they can disable the check.

 In the PR (https://github.com/apache/spark/pull/30210), the concern I
 got
 for now is, this changes current behavior and by default it will break
 some
 existing streaming queries. But I think it is pretty easy to disable
 the
 check with the new config. In the PR currently there is no objection
 but
 suggestion to hear more voices. Please let me know if you have some
 thoughts.

 Thanks.
 Liang-Chi Hsieh



 --
 Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

 -
 To unsubscribe e-mail: 

> dev-unsubscribe@.apache



>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>





--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-11 Thread Yuanjian Li
Already +1 in the PR. It would be great to mention the new config in the SS
migration guide.

Ryan Blue  于2020年11月11日周三 上午7:48写道:

> +1, I agree with Tom.
>
> On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 
> wrote:
>
>> +1 for Apache Spark 3.1.0.
>>
>> Bests,
>> Dongjoon.
>>
>> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
>> wrote:
>>
>>> +1 since its a correctness issue, I think its ok to change the behavior
>>> to make sure the user is aware of it and let them decide.
>>>
>>> Tom
>>>
>>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>>> vii...@gmail.com> wrote:
>>>
>>>
>>> Hi devs,
>>>
>>> In Spark structured streaming, chained stateful operators possibly
>>> produces
>>> incorrect results under the global watermark. SPARK-33259
>>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>>> demostrating what the correctness issue could be.
>>>
>>> Currently we don't prevent users running such queries. Because the
>>> possible
>>> correctness in chained stateful operators in streaming query is not
>>> straightforward for users. From users perspective, it will possibly be
>>> considered as a Spark bug like SPARK-33259. It is also possible the worse
>>> case, users are not aware of the correctness issue and use wrong results.
>>>
>>> IMO, it is better to disable such queries and let users choose to run the
>>> query if they understand there is such risk, instead of implicitly
>>> running
>>> the query and let users to find out correctness issue by themselves.
>>>
>>> I would like to propose to disable the streaming query with possible
>>> correctness issue in chained stateful operators. The behavior can be
>>> controlled by a SQL config, so if users understand the risk and still
>>> want
>>> to run the query, they can disable the check.
>>>
>>> In the PR (https://github.com/apache/spark/pull/30210), the concern I
>>> got
>>> for now is, this changes current behavior and by default it will break
>>> some
>>> existing streaming queries. But I think it is pretty easy to disable the
>>> check with the new config. In the PR currently there is no objection but
>>> suggestion to hear more voices. Please let me know if you have some
>>> thoughts.
>>>
>>> Thanks.
>>> Liang-Chi Hsieh
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>


Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Ryan Blue
+1, I agree with Tom.

On Tue, Nov 10, 2020 at 3:00 PM Dongjoon Hyun 
wrote:

> +1 for Apache Spark 3.1.0.
>
> Bests,
> Dongjoon.
>
> On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
> wrote:
>
>> +1 since its a correctness issue, I think its ok to change the behavior
>> to make sure the user is aware of it and let them decide.
>>
>> Tom
>>
>> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
>> vii...@gmail.com> wrote:
>>
>>
>> Hi devs,
>>
>> In Spark structured streaming, chained stateful operators possibly
>> produces
>> incorrect results under the global watermark. SPARK-33259
>> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
>> demostrating what the correctness issue could be.
>>
>> Currently we don't prevent users running such queries. Because the
>> possible
>> correctness in chained stateful operators in streaming query is not
>> straightforward for users. From users perspective, it will possibly be
>> considered as a Spark bug like SPARK-33259. It is also possible the worse
>> case, users are not aware of the correctness issue and use wrong results.
>>
>> IMO, it is better to disable such queries and let users choose to run the
>> query if they understand there is such risk, instead of implicitly running
>> the query and let users to find out correctness issue by themselves.
>>
>> I would like to propose to disable the streaming query with possible
>> correctness issue in chained stateful operators. The behavior can be
>> controlled by a SQL config, so if users understand the risk and still want
>> to run the query, they can disable the check.
>>
>> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
>> for now is, this changes current behavior and by default it will break
>> some
>> existing streaming queries. But I think it is pretty easy to disable the
>> check with the new config. In the PR currently there is no objection but
>> suggestion to hear more voices. Please let me know if you have some
>> thoughts.
>>
>> Thanks.
>> Liang-Chi Hsieh
>>
>>
>>
>> --
>> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>

-- 
Ryan Blue
Software Engineer
Netflix


Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Dongjoon Hyun
+1 for Apache Spark 3.1.0.

Bests,
Dongjoon.

On Tue, Nov 10, 2020 at 6:17 AM Tom Graves 
wrote:

> +1 since its a correctness issue, I think its ok to change the behavior to
> make sure the user is aware of it and let them decide.
>
> Tom
>
> On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh <
> vii...@gmail.com> wrote:
>
>
> Hi devs,
>
> In Spark structured streaming, chained stateful operators possibly produces
> incorrect results under the global watermark. SPARK-33259
> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
> demostrating what the correctness issue could be.
>
> Currently we don't prevent users running such queries. Because the possible
> correctness in chained stateful operators in streaming query is not
> straightforward for users. From users perspective, it will possibly be
> considered as a Spark bug like SPARK-33259. It is also possible the worse
> case, users are not aware of the correctness issue and use wrong results.
>
> IMO, it is better to disable such queries and let users choose to run the
> query if they understand there is such risk, instead of implicitly running
> the query and let users to find out correctness issue by themselves.
>
> I would like to propose to disable the streaming query with possible
> correctness issue in chained stateful operators. The behavior can be
> controlled by a SQL config, so if users understand the risk and still want
> to run the query, they can disable the check.
>
> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
> for now is, this changes current behavior and by default it will break some
> existing streaming queries. But I think it is pretty easy to disable the
> check with the new config. In the PR currently there is no objection but
> suggestion to hear more voices. Please let me know if you have some
> thoughts.
>
> Thanks.
> Liang-Chi Hsieh
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-10 Thread Tom Graves
 +1 since its a correctness issue, I think its ok to change the behavior to 
make sure the user is aware of it and let them decide.
Tom
On Saturday, November 7, 2020, 01:00:11 AM CST, Liang-Chi Hsieh 
 wrote:  
 
 Hi devs,

In Spark structured streaming, chained stateful operators possibly produces
incorrect results under the global watermark. SPARK-33259
(https://issues.apache.org/jira/browse/SPARK-33259) has an example
demostrating what the correctness issue could be.

Currently we don't prevent users running such queries. Because the possible
correctness in chained stateful operators in streaming query is not
straightforward for users. From users perspective, it will possibly be
considered as a Spark bug like SPARK-33259. It is also possible the worse
case, users are not aware of the correctness issue and use wrong results.

IMO, it is better to disable such queries and let users choose to run the
query if they understand there is such risk, instead of implicitly running
the query and let users to find out correctness issue by themselves.

I would like to propose to disable the streaming query with possible
correctness issue in chained stateful operators. The behavior can be
controlled by a SQL config, so if users understand the risk and still want
to run the query, they can disable the check.

In the PR (https://github.com/apache/spark/pull/30210), the concern I got
for now is, this changes current behavior and by default it will break some
existing streaming queries. But I think it is pretty easy to disable the
check with the new config. In the PR currently there is no objection but
suggestion to hear more voices. Please let me know if you have some
thoughts.

Thanks.
Liang-Chi Hsieh



--
Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

  

Re: [DISCUSS] Disable streaming query with possible correctness issue by default

2020-11-08 Thread Jungtaek Lim
After the check logic was introduced in Spark 3.0, there's another related
issue I addressed in Spark 3.1, SPARK-24634 [1].

Before SPARK-24634, there's no way to know how many rows are discarded due
to being late, even whether there's any late row or not. That said, the
issue has been the correctness issue "silently" impacting the
result. SPARK-24634 will provide the overall number of late rows in the
streaming listener, as well as the number of late rows "per operator" in
the SQL UI graph. So end users are no longer "blindly" impacted.

Even though, I'd agree that it's pretty hard to construct the query
which avoids correctness issues and still does chained stateful operations.
I see two separate JIRA issues on reporting the same correctness behavior,
meaning this is already impacting the end users' queries. (More number of
end users may not even notice the impact, as SPARK-24634 isn't released
yet.)

So overall I'm +1 to prevent the query in prior. This change would possibly
break some of user queries, but I'd suspect they might suffer from
correctness and they even didn't notice that.

For sure, a better approach would be dropping global watermark and
implementing operator-wise watermark properly. This is just a workaround,
but fixing watermark would require major effort.

Thanks,
Jungtaek Lim (HeartSaVioR)

1. https://issues.apache.org/jira/browse/SPARK-24634


On Sat, Nov 7, 2020 at 3:59 PM Liang-Chi Hsieh  wrote:

> Hi devs,
>
> In Spark structured streaming, chained stateful operators possibly produces
> incorrect results under the global watermark. SPARK-33259
> (https://issues.apache.org/jira/browse/SPARK-33259) has an example
> demostrating what the correctness issue could be.
>
> Currently we don't prevent users running such queries. Because the possible
> correctness in chained stateful operators in streaming query is not
> straightforward for users. From users perspective, it will possibly be
> considered as a Spark bug like SPARK-33259. It is also possible the worse
> case, users are not aware of the correctness issue and use wrong results.
>
> IMO, it is better to disable such queries and let users choose to run the
> query if they understand there is such risk, instead of implicitly running
> the query and let users to find out correctness issue by themselves.
>
> I would like to propose to disable the streaming query with possible
> correctness issue in chained stateful operators. The behavior can be
> controlled by a SQL config, so if users understand the risk and still want
> to run the query, they can disable the check.
>
> In the PR (https://github.com/apache/spark/pull/30210), the concern I got
> for now is, this changes current behavior and by default it will break some
> existing streaming queries. But I think it is pretty easy to disable the
> check with the new config. In the PR currently there is no objection but
> suggestion to hear more voices. Please let me know if you have some
> thoughts.
>
> Thanks.
> Liang-Chi Hsieh
>
>
>
> --
> Sent from: http://apache-spark-developers-list.1001551.n3.nabble.com/
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>