Re: [DISCUSS] Flip the default value of Kafka offset fetching config (spark.sql.streaming.kafka.useDeprecatedOffsetFetching)

2022-10-14 Thread Dongjoon Hyun
+1

I agree with Jungtaek and Gabor about switching the default value of
configurations with the migration guide.

Dongjoon

On Thu, Oct 13, 2022 at 12:46 AM Gabor Somogyi 
wrote:

> Hi Jungtaek,
>
> Good to hear that the new approach is working fine. +1 from my side.
>
> BR,
> G
>
>
> On Thu, Oct 13, 2022 at 4:12 AM Jungtaek Lim 
> wrote:
>
>> Hi all,
>>
>> I would like to propose flipping the default value of Kafka offset
>> fetching config. The context is following:
>>
>> Before Spark 3.1, there was only one approach on fetching offset, using
>> consumer.poll(0). This has been pointed out as a root cause for hang since
>> there is no timeout for metadata fetch.
>>
>> In Spark 3.1, we addressed this via introducing a new approach on
>> fetching offset, via SPARK-32032
>> . Since the new
>> approach leverages AdminClient and consumer group is no longer needed for
>> fetching offset, required security ACLs are loosen.
>>
>> Reference:
>> https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#offset-fetching
>>
>> There was some concern about behavioral change on the security model
>> hence we couldn't make the new approach by default.
>>
>> During the time, we have observed various Kafka connector related issues
>> which came from old offset fetching (e.g. hang, issues on rebalance on
>> customer group, etc.) and we fixed many of these issues via simply flipping
>> the config.
>>
>> Based on this, I would consider the default value as "incorrect". The
>> security-related behavioral change would be introduced inevitably (they can
>> set topic based ACL rule), but most people will get benefited. IMHO this is
>> something we can deal with release/migration note.
>>
>> Would like to hear the voices on this.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>


Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Martin Grund
I have prepared the following pull request that enforces scalafmt on the
Spark Connect module. Please feel free to have a look and leave comments.
If we're reaching consensus on the decision, I will take care of pushing a
PR as well for the previously mentioned website to add a small note on
using `dev/lint-scala` for local style checks.

Thanks
Martin

On Fri, Oct 14, 2022 at 11:09 AM Yikun Jiang  wrote:

> +1, I also think it's a good idea.
>
> BTW, we might also consider adding some notes about `lint-scala` in [1],
> just like `lint-python` in pyspark [2].
>
> [1] https://spark.apache.org/developer-tools.html
> [2]
> https://spark.apache.org/docs/latest/api/python/development/contributing.html
>
>
> Regards,
> Yikun
>
>
> On Fri, Oct 14, 2022 at 4:51 PM Hyukjin Kwon  wrote:
>
>> I personally like this idea. At least we now do this in PySpark, and it's
>> pretty nice that you can just forget about formatting it manually by
>> yourself.
>>
>> On Fri, 14 Oct 2022 at 16:37, Martin Grund
>>  wrote:
>>
>>> Hi folks,
>>>
>>> I'm reaching out to ask to gather input / consensus on the following
>>> proposal: Since Spark Connect is effectively new code, I would like to
>>> enforce scalafmt explicitly *only* on this module by adding a check in
>>> `dev/lint-scala` that checks if there is a diff after running
>>>
>>>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
>>> connector/connect
>>>
>>> I know that enforcing scalafmt is not desirable on the existing code
>>> base but since the Spark Connect code is very new I'm thinking it might
>>> reduce friction in the code reviews and create a consistent style.
>>>
>>> In my previous code reviews where I have applied scalafmt I've
>>> received feedback on the import grouping that scalafmt is changing
>>> different from our default style. I've prepared a PR
>>> https://github.com/apache/spark/pull/38252 to address this issue by
>>> explicitly setting it in the scalafmt option.
>>>
>>> Would you be supportive of enforcing scalafmt *only* on the Spark
>>> Connect module?
>>>
>>> Thanks
>>> Martin
>>>
>>


Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Yikun Jiang
+1, I also think it's a good idea.

BTW, we might also consider adding some notes about `lint-scala` in [1],
just like `lint-python` in pyspark [2].

[1] https://spark.apache.org/developer-tools.html
[2]
https://spark.apache.org/docs/latest/api/python/development/contributing.html


Regards,
Yikun


On Fri, Oct 14, 2022 at 4:51 PM Hyukjin Kwon  wrote:

> I personally like this idea. At least we now do this in PySpark, and it's
> pretty nice that you can just forget about formatting it manually by
> yourself.
>
> On Fri, 14 Oct 2022 at 16:37, Martin Grund
>  wrote:
>
>> Hi folks,
>>
>> I'm reaching out to ask to gather input / consensus on the following
>> proposal: Since Spark Connect is effectively new code, I would like to
>> enforce scalafmt explicitly *only* on this module by adding a check in
>> `dev/lint-scala` that checks if there is a diff after running
>>
>>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
>> connector/connect
>>
>> I know that enforcing scalafmt is not desirable on the existing code base
>> but since the Spark Connect code is very new I'm thinking it might reduce
>> friction in the code reviews and create a consistent style.
>>
>> In my previous code reviews where I have applied scalafmt I've
>> received feedback on the import grouping that scalafmt is changing
>> different from our default style. I've prepared a PR
>> https://github.com/apache/spark/pull/38252 to address this issue by
>> explicitly setting it in the scalafmt option.
>>
>> Would you be supportive of enforcing scalafmt *only* on the Spark
>> Connect module?
>>
>> Thanks
>> Martin
>>
>


Re: Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Hyukjin Kwon
I personally like this idea. At least we now do this in PySpark, and it's
pretty nice that you can just forget about formatting it manually by
yourself.

On Fri, 14 Oct 2022 at 16:37, Martin Grund
 wrote:

> Hi folks,
>
> I'm reaching out to ask to gather input / consensus on the following
> proposal: Since Spark Connect is effectively new code, I would like to
> enforce scalafmt explicitly *only* on this module by adding a check in
> `dev/lint-scala` that checks if there is a diff after running
>
>  ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
> connector/connect
>
> I know that enforcing scalafmt is not desirable on the existing code base
> but since the Spark Connect code is very new I'm thinking it might reduce
> friction in the code reviews and create a consistent style.
>
> In my previous code reviews where I have applied scalafmt I've
> received feedback on the import grouping that scalafmt is changing
> different from our default style. I've prepared a PR
> https://github.com/apache/spark/pull/38252 to address this issue by
> explicitly setting it in the scalafmt option.
>
> Would you be supportive of enforcing scalafmt *only* on the Spark Connect
> module?
>
> Thanks
> Martin
>


Enforcing scalafmt on Spark Connect - connector/connect

2022-10-14 Thread Martin Grund
Hi folks,

I'm reaching out to ask to gather input / consensus on the following
proposal: Since Spark Connect is effectively new code, I would like to
enforce scalafmt explicitly *only* on this module by adding a check in
`dev/lint-scala` that checks if there is a diff after running

 ./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false -pl
connector/connect

I know that enforcing scalafmt is not desirable on the existing code base
but since the Spark Connect code is very new I'm thinking it might reduce
friction in the code reviews and create a consistent style.

In my previous code reviews where I have applied scalafmt I've
received feedback on the import grouping that scalafmt is changing
different from our default style. I've prepared a PR
https://github.com/apache/spark/pull/38252 to address this issue by
explicitly setting it in the scalafmt option.

Would you be supportive of enforcing scalafmt *only* on the Spark Connect
module?

Thanks
Martin