Re: [DISCUSS] Migration guide on upgrading Kafka to 3.1 in Spark 3.3

2022-03-24 Thread Gabor Somogyi
I've had a small talk to the Kafka guys to find out a little bit more and
the oversimplified conclusion is that if
the producer version >= 3.0 and broker version < 0.11.0 with message format
version V1 then
either `enable.idempotence = false` needed or broker upgrade to 0.11.0+ is
required to make it work.

There are also ACL related changes where the error message is quite obvious
what needs to be done.

I'm not sure how useful the error message is in the mentioned broker issue
case. If the message is obvious then we may not need to add any special
explanation.

AFAIK users are not really using 0.11- brokers because that's an ancient
version so the problem surface may be not so significant.

G


On Wed, Mar 23, 2022 at 11:32 PM Mich Talebzadeh 
wrote:

> Maybe I misunderstood this explanation.
>
> Agreed. Spark relies on Kafka, Google Pub/Sub or any other messaging
> systems to process the related streaming data via topic or topics and
> present them to Spark. At this stage, Spark does not care to know how this
> streaming data is produced. Spark relies on the appropriate API to read
> data from Kafka or from Google Pub/Sub. For example if you are processing
> temperature, you construct a streaming dataframe that subscribes to a
> topic say temperature. As long as you have the correct jar files to
> interface with Kafka and that takes care of compatibility, this should
> work. In reality Kafka will be running on its own container(s) plus the
> zookeeper containers if any. So as far as i can ascertain, the
> discussion is about deploying the compatible APIs
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 23 Mar 2022 at 20:12, Jungtaek Lim 
> wrote:
>
>> If it requires a Kafka broker update, we should not simply bump the
>> version of Kafka client. Probably we should at least provide separate
>> artifacts.
>>
>> We should not enforce the upgrade of other component just because we want
>> to upgrade the dependency. At least it should not happen in minor versions
>> of Spark. Hopefully that’s not a case.
>>
>> There’s a belief that Kafka client-broker compatibility is both backwards
>> and forwards. That is true in many cases (that’s what Kafka excels to), but
>> there seems to be the case it is not satisfied with specific configuration
>> and specific setup of Kafka broker. E.g KIP-679.
>>
>> The less compatible config is going to turn on by default in 3.0 and will
>> not work correctly with the specific setup of Kafka broker. So that is us
>> who breaks things for specific setup, and my point is how much
>> responsibility we should have to guide the end users to avoid the
>> frustration.
>>
>> 2022년 3월 23일 (수) 오후 9:41, Sean Owen 님이 작성:
>>
>>> Well, yes, but if it requires a Kafka server-side update, it does, and
>>> that is out of scope for us to document.
>>> It is important that we document if and how (if we know) the client
>>> update will impact existing Kafka installations (does it require a
>>> server-side update or not?), and document the change itself for sure along
>>> with any Spark-side migration notes.
>>>
>>> On Fri, Mar 18, 2022 at 8:47 PM Jungtaek Lim <
>>> kabhwan.opensou...@gmail.com> wrote:
>>>
 The thing is, it is “us” who upgrades Kafka client and makes possible
 divergence between client and broker in end users’ production env.

 Someone can claim that end users can downgrade the kafka-client
 artifact when building their app so that the version can be matched, but we
 don’t test anything against downgrading kafka-client version for kafka
 connector. That sounds to me we defer our work to end users.

 It sounds to me “someone” should refer to us, and then it is no longer
 a matter of “help”. It is a matter of “responsibility”, as you said.

 2022년 3월 18일 (금) 오후 10:15, Sean Owen 님이 작성:

> I think we can assume that someone upgrading Kafka will be responsible
> for thinking through the breaking changes. We can help by listing anything
> we know could affect Spark-Kafka usage and calling those out in a release
> note, for sure. I don't think we need to get into items that would affect
> Kafka usage itself; focus on the connector-related issues.
>
> On Fri, Mar 18, 2022 at 5:15 AM Jungtaek Lim <
> kabhwan.opensou...@gmail.com> wrote:
>
>> CORRECTION: in option 2, we enumerate KIPs which may bring
>> incompatibility with older brokers (not all KIPs).
>>
>> On Fri, Mar 18, 2022 at 7:12 PM Jungtaek Lim <
>> kabhwan.opensou...

Probable bug in async commit of Kafka offset in DirectKafkaInputDStream

2022-03-24 Thread Paul, Souvik
Hi Dev,

I added a few debug statements at the following lines and found few issues.

1. At line 254 of override def compute(validTime: Time): Option[KafkaRDD[K, V]] 
in DirectKafkaInputDStream.scala:

System.out.print("Called commitAll at time " + validTime + " " +
commitQueue.toArray.mkString("Array(", ", ", ")") + "\n")

2. At line 454 of test("offset recovery from kafka") in 
DirectKafkaStreamSuite.scala:

print("Called commitAsync at " + time +  " " + 
offsets.mkString("Array(", ", ", ")") + "\n")


This shows that the commitAll call is not properly handled. Since, it is called 
inside compute function. There is a chance that during last RDD, we will miss 
the last offset. In the current example we have missed the offset commit of 
range 8->10.

Can someone confirm if this is a design choice or a bug?

The current log is something like this.

Called commitAll at time 1645548063100 ms Array()
Called commitAll at time 1645548063200 ms Array()
Called commitAll at time 1645548063300 ms Array()
Called commitAll at time 1645548063400 ms Array()
Called commitAll at time 1645548063500 ms Array()
Called commitAll at time 1645548063600 ms Array()
Called commitAll at time 1645548063700 ms Array()
Called commitAll at time 1645548063800 ms Array()
Called commitAll at time 1645548063900 ms Array()
Called commitAll at time 1645548064000 ms Array()
Called commitAll at time 1645548064100 ms Array()
Called commitAll at time 1645548064200 ms Array()
Called commitAsync at 1645548063100 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [0 -> 4]))
Called commitAsync at 1645548063200 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAll at time 1645548064300 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [0 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063300 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063400 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063500 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAll at time 1645548064400 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063600 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063700 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063800 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548063900 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAll at time 1645548064500 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548064000 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548064100 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548064200 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548064300 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAll at time 1645548064600 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 4]))
Called commitAsync at 1645548064400 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 8]))
Called commitAsync at 1645548064500 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [8 -> 8]))
Called commitAsync at 1645548064600 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [8 -> 8]))
Called commitAll at time 1645548064700 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [4 -> 8]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [8 -> 8]), OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [8 -> 8]))
Called commitAsync at 1645548064700 ms Array(OffsetRange(topic: 
'recoveryfromkafka', partition: 0, range: [8 -> 10]))

Regards,

Souvik Paul
GitHub: @paulsouri



Your Personal D

Re: Tools for regression testing

2022-03-24 Thread Bjørn Jørgensen
Yes, Spark uses unit tests.

https://app.codecov.io/gh/apache/spark

https://en.wikipedia.org/wiki/Unit_testing



man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> Hi,
>
> As a matter of interest do Spark releases deploy a specific regression
> testing tool?
>
> Thanks
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>


-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Re: Tools for regression testing

2022-03-24 Thread Mich Talebzadeh
Thanks

I know what unit testing is. The question was not about unit testing. it
was specific to regression testing

 artifacts .


cheers,


Mich


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
wrote:

> Yes, Spark uses unit tests.
>
> https://app.codecov.io/gh/apache/spark
>
> https://en.wikipedia.org/wiki/Unit_testing
>
>
>
> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
> mich.talebza...@gmail.com>:
>
>> Hi,
>>
>> As a matter of interest do Spark releases deploy a specific regression
>> testing tool?
>>
>> Thanks
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>
>
> --
> Bjørn Jørgensen
> Vestre Aspehaug 4, 6010 Ålesund
> Norge
>
> +47 480 94 297
>


Re: Tools for regression testing

2022-03-24 Thread Sean Owen
Hm, then what are you looking for besides all the tests in Spark?

On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
wrote:

> Thanks
>
> I know what unit testing is. The question was not about unit testing. it
> was specific to regression testing
> 
>  artifacts .
>
>
> cheers,
>
>
> Mich
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
> wrote:
>
>> Yes, Spark uses unit tests.
>>
>> https://app.codecov.io/gh/apache/spark
>>
>> https://en.wikipedia.org/wiki/Unit_testing
>>
>>
>>
>> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
>> mich.talebza...@gmail.com>:
>>
>>> Hi,
>>>
>>> As a matter of interest do Spark releases deploy a specific regression
>>> testing tool?
>>>
>>> Thanks
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>
>>
>> --
>> Bjørn Jørgensen
>> Vestre Aspehaug 4, 6010 Ålesund
>> Norge
>>
>> +47 480 94 297
>>
>


Re: Tools for regression testing

2022-03-24 Thread Mich Talebzadeh
good point.

I just wanted to know when we do changes to releases or RC, is there some
mechanism that ensures the Spark release still functions as expected after
any code changes, updates etc?

For example there was a recent discussion about Kafka upgrade to 3.x with
Spark upgrade to 3.x and its likely impact. Integration testing can be
achieved through CI/CD which I believe Spark relied on Jenkins until
recently.

HTH



   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 24 Mar 2022 at 19:38, Sean Owen  wrote:

> Hm, then what are you looking for besides all the tests in Spark?
>
> On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
> wrote:
>
>> Thanks
>>
>> I know what unit testing is. The question was not about unit testing. it
>> was specific to regression testing
>> 
>>  artifacts .
>>
>>
>> cheers,
>>
>>
>> Mich
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
>> wrote:
>>
>>> Yes, Spark uses unit tests.
>>>
>>> https://app.codecov.io/gh/apache/spark
>>>
>>> https://en.wikipedia.org/wiki/Unit_testing
>>>
>>>
>>>
>>> man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
>>> mich.talebza...@gmail.com>:
>>>
 Hi,

 As a matter of interest do Spark releases deploy a specific regression
 testing tool?

 Thanks



view my Linkedin profile
 


  https://en.everybodywiki.com/Mich_Talebzadeh



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, damage or destruction of data or any other property which may
 arise from relying on this email's technical content is explicitly
 disclaimed. The author will in no case be liable for any monetary damages
 arising from such loss, damage or destruction.



>>>
>>>
>>> --
>>> Bjørn Jørgensen
>>> Vestre Aspehaug 4, 6010 Ålesund
>>> Norge
>>>
>>> +47 480 94 297
>>>
>>


Re: Tools for regression testing

2022-03-24 Thread Bjørn Jørgensen
At the wikipedia regression testing page
https://en.wikipedia.org/wiki/Regression_testing

Under use
"
Regression tests can be broadly categorized as functional tests
 or unit tests
. Functional tests exercise the
complete program with various inputs. Unit tests exercise individual
functions, subroutines , or
object methods. Both functional testing tools and unit-testing tools tend
to be automated and are often third-party products that are not part of the
compiler suite. A functional test may be a scripted series of program
inputs, possibly even involving an automated mechanism for controlling
mouse movements and clicks. A unit test may be a set of separate functions
within the code itself or a driver layer that links to the code without
altering the code being tested.
"

When you change or add anything to Spark. You fork the git repo. then you
make the changes, then you make a new branch. and push the branch to yours
fork git repo.

"Push commits to your branch. This will trigger “Build and test” and
“Report test results” workflows on your forked repository and start testing
and validating your changes."
This is number 7 in Howto contributing to Spark under Pull request



So to answer your question. Yes, every change is tested before the change
[PR] goes to the master branch.
We don't have unit tests for everything, so we have to test things
manually after building.




tor. 24. mar. 2022 kl. 20:47 skrev Mich Talebzadeh <
mich.talebza...@gmail.com>:

> good point.
>
> I just wanted to know when we do changes to releases or RC, is there some
> mechanism that ensures the Spark release still functions as expected
> after any code changes, updates etc?
>
> For example there was a recent discussion about Kafka upgrade to 3.x with
> Spark upgrade to 3.x and its likely impact. Integration testing can be
> achieved through CI/CD which I believe Spark relied on Jenkins until
> recently.
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 24 Mar 2022 at 19:38, Sean Owen  wrote:
>
>> Hm, then what are you looking for besides all the tests in Spark?
>>
>> On Thu, Mar 24, 2022, 2:34 PM Mich Talebzadeh 
>> wrote:
>>
>>> Thanks
>>>
>>> I know what unit testing is. The question was not about unit testing. it
>>> was specific to regression testing
>>> 
>>>  artifacts .
>>>
>>>
>>> cheers,
>>>
>>>
>>> Mich
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 24 Mar 2022 at 19:02, Bjørn Jørgensen 
>>> wrote:
>>>
 Yes, Spark uses unit tests.

 https://app.codecov.io/gh/apache/spark

 https://en.wikipedia.org/wiki/Unit_testing



 man. 21. mar. 2022 kl. 15:46 skrev Mich Talebzadeh <
 mich.talebza...@gmail.com>:

> Hi,
>
> As a matter of interest do Spark releases deploy a specific regression
> testing tool?
>
> Thanks
>
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for
> any loss, damage or destruction of data or any other property which may
> arise from relying on this email's technical content is explicitly
> disclaimed. The author will in no case be liable for any monetary damages
> arising from such loss, damage or destruction.
>
>
>


 --
 Bjørn Jørgensen
 Vestre Aspehaug 4, 6010 Ålesund
 Norge

 +47 480 94 297

>>>

-- 
Bjørn Jørgensen
Vestre Aspehaug 4, 6010 Ålesund
Norge

+47 480 94 297


Final recap: SPIP: Support Customized Kubernetes Scheduler

2022-03-24 Thread Yikun Jiang
Last month, I synced some progress
 on
"Support Customized Kubernetes Scheduler" [1] at 24. Feb. 2022.

Another month has passed, with the cut of the 3.3 release, there are also
some changes on SPIP. I'd like to share in here:

TLDR: below changes are merged in last month before 3.3 branch cut:

1. [Common] SPARK-38383 :
Support APP_ID and EXECUTOR_ID placeholder in annotations
This will help Yunikorn set appid annotation.
Thanks @dongjoon-hyun @weiweiyang

2. [Common] SPARK-38561
: Add
doc for Customized Kubernetes Schedulers

3. [Volcano] SPARK-38455 :
Introduce a new
configuration: spark.kubernetes.scheduler.volcano.podGroupTemplateFile
to replace original configuration design:
spark.kubernetes.job.[queue|minRes|priority]
Thanks @dongjoon-hyun

4. [Volcano] Add queue/priority/resource reservation(gang) scheduling
integration test:
Queue scheduling: SPARK-38188

Priority scheduling: SPARK-38423

Resource reservation (Gang): SPARK-38187


5. [Volcano] adding doc for volcano scheduler: SPARK-38562
:

6. Volcano community are adding Spark + Volcano IT:
https://github.com/volcano-sh/volcano/pull/2113

I also completed a new slide as final recap to help you understand above
and all we done:
https://docs.google.com/presentation/d/1itSii7C4gkLhsTwO9aWHLbqSVgxEJkv1zcowS_8ynJc