subject:"Re\: Change parallelism number in Spark Streaming"

Re: Change parallelism number in Spark Streaming

2019-06-27 Thread Jungtaek Lim

Great, thanks! Even better if you could share the slide as well (and if
possible video too), since it would be helpful for other users to
understand about the details.

Thanks again,
Jungtaek Lim (HeartSaVioR)

On Thu, Jun 27, 2019 at 7:33 PM Jacek Laskowski  wrote:

> Hi,
>
> I've got a talk "The internals of stateful stream processing in Spark
> Structured Streaming" at https://dataxday.fr/ today and am going to
> include the tool on the slides to thank you for the work. Thanks.
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://about.me/JacekLaskowski
> The Internals of Spark SQL https://bit.ly/spark-sql-internals
> The Internals of Spark Structured Streaming
> https://bit.ly/spark-structured-streaming
> The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
> Follow me at https://twitter.com/jaceklaskowski
>
>
>
> On Thu, Jun 27, 2019 at 3:32 AM Jungtaek Lim  wrote:
>
>> Glad to help, Jacek.
>>
>> I'm happy you're doing similar thing, which means it could be pretty
>> useful for others as well. Looks like it might be good enough to contribute
>> state source and sink. I'll sort out my code and submit a PR.
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>>
>> On Thu, Jun 27, 2019 at 7:54 AM Jacek Laskowski  wrote:
>>
>>> Hi Jungtaek,
>>>
>>> That's very helpful to have the state source. As a matter of fact I've
>>> just this week been working on a similar tool (!) and have been wondering
>>> how to recreate the schema of the state key and value. You've helped me a
>>> lot. Thanks.
>>>
>>> Jacek
>>>
>>> On Wed, 26 Jun 2019, 23:58 Jungtaek Lim,  wrote:
>>>
>>>> Hi,
>>>>
>>>> you could consider state operator's partition numbers as "max
>>>> parallelism", as parallelism can be reduced via applying coalesce. It would
>>>> be effectively working similar as key groups.
>>>>
>>>> If you're also considering offline query, there's a tool to manipulate
>>>> state which enables reading and writing state in structured streaming,
>>>> achieving rescaling and schema evolution.
>>>>
>>>> https://github.com/HeartSaVioR/spark-state-tools
>>>> (DISCLAIMER: I'm an author of this tool.)
>>>>
>>>> Thanks,
>>>> Jungtaek Lim (HeartSaVioR)
>>>>
>>>> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei 
>>>> wrote:
>>>>
>>>>> Thank you for your quick reply!
>>>>>
>>>>> Is there any plan to improve this?
>>>>>
>>>>> I asked this question due to some investigation on comparing those
>>>>> state of art streaming systems, among which Flink and DataFlow allow
>>>>> changing parallelism number, and by my knowledge of Spark Streaming, it
>>>>> seems it is also able to do that: if some “key interval” concept is used,
>>>>> then state can somehow decoupled from partition number by consistent
>>>>> hashing.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jialei
>>>>>
>>>>>
>>>>>
>>>>> *From: *Jacek Laskowski 
>>>>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>>>>> *To: *"Rong, Jialei" 
>>>>> *Cc: *"user @spark" 
>>>>> *Subject: *Re: Change parallelism number in Spark Streaming
>>>>>
>>>>>
>>>>>
>>>>> Hi,
>>>>>
>>>>>
>>>>>
>>>>> It's not allowed to change the numer of partitions after your
>>>>> streaming query is started.
>>>>>
>>>>>
>>>>>
>>>>> The reason is exactly the number of state stores which is exactly the
>>>>> number of partitions (perhaps multiplied by the number of stateful
>>>>> operators).
>>>>>
>>>>>
>>>>>
>>>>> I think you'll even get a warning or an exception when you change it
>>>>> after restarting the query.
>>>>>
>>>>>
>>>>>
>>>>> The number of partitions is stored in a checkpoint location.
>>>>>
>>>>>
>>>>>
>>>>> Jacek
>>>>>
>>>>>
>>>>>
>>>>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
>>>>> wrote:
>>>>>
>>>>> Hi Dear Spark Expert
>>>>>
>>>>>
>>>>>
>>>>> I’m curious about a question regarding Spark Streaming/Structured
>>>>> Streaming: whether it allows to change parallelism number(the default one
>>>>> or the one specified in particular operator) in a stream having stateful
>>>>> transform/operator? Whether this will cause my checkpointed state get
>>>>> messed up?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Regards
>>>>>
>>>>> Jialei
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> Name : Jungtaek Lim
>>>> Blog : http://medium.com/@heartsavior
>>>> Twitter : http://twitter.com/heartsavior
>>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>>
>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Change parallelism number in Spark Streaming

2019-06-27 Thread Jacek Laskowski

Hi,

I've got a talk "The internals of stateful stream processing in Spark
Structured Streaming" at https://dataxday.fr/ today and am going to include
the tool on the slides to thank you for the work. Thanks.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Thu, Jun 27, 2019 at 3:32 AM Jungtaek Lim  wrote:

> Glad to help, Jacek.
>
> I'm happy you're doing similar thing, which means it could be pretty
> useful for others as well. Looks like it might be good enough to contribute
> state source and sink. I'll sort out my code and submit a PR.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> On Thu, Jun 27, 2019 at 7:54 AM Jacek Laskowski  wrote:
>
>> Hi Jungtaek,
>>
>> That's very helpful to have the state source. As a matter of fact I've
>> just this week been working on a similar tool (!) and have been wondering
>> how to recreate the schema of the state key and value. You've helped me a
>> lot. Thanks.
>>
>> Jacek
>>
>> On Wed, 26 Jun 2019, 23:58 Jungtaek Lim,  wrote:
>>
>>> Hi,
>>>
>>> you could consider state operator's partition numbers as "max
>>> parallelism", as parallelism can be reduced via applying coalesce. It would
>>> be effectively working similar as key groups.
>>>
>>> If you're also considering offline query, there's a tool to manipulate
>>> state which enables reading and writing state in structured streaming,
>>> achieving rescaling and schema evolution.
>>>
>>> https://github.com/HeartSaVioR/spark-state-tools
>>> (DISCLAIMER: I'm an author of this tool.)
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei 
>>> wrote:
>>>
>>>> Thank you for your quick reply!
>>>>
>>>> Is there any plan to improve this?
>>>>
>>>> I asked this question due to some investigation on comparing those
>>>> state of art streaming systems, among which Flink and DataFlow allow
>>>> changing parallelism number, and by my knowledge of Spark Streaming, it
>>>> seems it is also able to do that: if some “key interval” concept is used,
>>>> then state can somehow decoupled from partition number by consistent
>>>> hashing.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Jialei
>>>>
>>>>
>>>>
>>>> *From: *Jacek Laskowski 
>>>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>>>> *To: *"Rong, Jialei" 
>>>> *Cc: *"user @spark" 
>>>> *Subject: *Re: Change parallelism number in Spark Streaming
>>>>
>>>>
>>>>
>>>> Hi,
>>>>
>>>>
>>>>
>>>> It's not allowed to change the numer of partitions after your streaming
>>>> query is started.
>>>>
>>>>
>>>>
>>>> The reason is exactly the number of state stores which is exactly the
>>>> number of partitions (perhaps multiplied by the number of stateful
>>>> operators).
>>>>
>>>>
>>>>
>>>> I think you'll even get a warning or an exception when you change it
>>>> after restarting the query.
>>>>
>>>>
>>>>
>>>> The number of partitions is stored in a checkpoint location.
>>>>
>>>>
>>>>
>>>> Jacek
>>>>
>>>>
>>>>
>>>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
>>>> wrote:
>>>>
>>>> Hi Dear Spark Expert
>>>>
>>>>
>>>>
>>>> I’m curious about a question regarding Spark Streaming/Structured
>>>> Streaming: whether it allows to change parallelism number(the default one
>>>> or the one specified in particular operator) in a stream having stateful
>>>> transform/operator? Whether this will cause my checkpointed state get
>>>> messed up?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Regards
>>>>
>>>> Jialei
>>>>
>>>>
>>>>
>>>>
>>>
>>> --
>>> Name : Jungtaek Lim
>>> Blog : http://medium.com/@heartsavior
>>> Twitter : http://twitter.com/heartsavior
>>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jungtaek Lim

Glad to help, Jacek.

I'm happy you're doing similar thing, which means it could be pretty useful
for others as well. Looks like it might be good enough to contribute state
source and sink. I'll sort out my code and submit a PR.

Thanks,
Jungtaek Lim (HeartSaVioR)


On Thu, Jun 27, 2019 at 7:54 AM Jacek Laskowski  wrote:

> Hi Jungtaek,
>
> That's very helpful to have the state source. As a matter of fact I've
> just this week been working on a similar tool (!) and have been wondering
> how to recreate the schema of the state key and value. You've helped me a
> lot. Thanks.
>
> Jacek
>
> On Wed, 26 Jun 2019, 23:58 Jungtaek Lim,  wrote:
>
>> Hi,
>>
>> you could consider state operator's partition numbers as "max
>> parallelism", as parallelism can be reduced via applying coalesce. It would
>> be effectively working similar as key groups.
>>
>> If you're also considering offline query, there's a tool to manipulate
>> state which enables reading and writing state in structured streaming,
>> achieving rescaling and schema evolution.
>>
>> https://github.com/HeartSaVioR/spark-state-tools
>> (DISCLAIMER: I'm an author of this tool.)
>>
>> Thanks,
>> Jungtaek Lim (HeartSaVioR)
>>
>> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei 
>> wrote:
>>
>>> Thank you for your quick reply!
>>>
>>> Is there any plan to improve this?
>>>
>>> I asked this question due to some investigation on comparing those state
>>> of art streaming systems, among which Flink and DataFlow allow changing
>>> parallelism number, and by my knowledge of Spark Streaming, it seems it is
>>> also able to do that: if some “key interval” concept is used, then state
>>> can somehow decoupled from partition number by consistent hashing.
>>>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>> Jialei
>>>
>>>
>>>
>>> *From: *Jacek Laskowski 
>>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>>> *To: *"Rong, Jialei" 
>>> *Cc: *"user @spark" 
>>> *Subject: *Re: Change parallelism number in Spark Streaming
>>>
>>>
>>>
>>> Hi,
>>>
>>>
>>>
>>> It's not allowed to change the numer of partitions after your streaming
>>> query is started.
>>>
>>>
>>>
>>> The reason is exactly the number of state stores which is exactly the
>>> number of partitions (perhaps multiplied by the number of stateful
>>> operators).
>>>
>>>
>>>
>>> I think you'll even get a warning or an exception when you change it
>>> after restarting the query.
>>>
>>>
>>>
>>> The number of partitions is stored in a checkpoint location.
>>>
>>>
>>>
>>> Jacek
>>>
>>>
>>>
>>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
>>> wrote:
>>>
>>> Hi Dear Spark Expert
>>>
>>>
>>>
>>> I’m curious about a question regarding Spark Streaming/Structured
>>> Streaming: whether it allows to change parallelism number(the default one
>>> or the one specified in particular operator) in a stream having stateful
>>> transform/operator? Whether this will cause my checkpointed state get
>>> messed up?
>>>
>>>
>>>
>>>
>>>
>>> Regards
>>>
>>> Jialei
>>>
>>>
>>>
>>>
>>
>> --
>> Name : Jungtaek Lim
>> Blog : http://medium.com/@heartsavior
>> Twitter : http://twitter.com/heartsavior
>> LinkedIn : http://www.linkedin.com/in/heartsavior
>>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski

Hi Jungtaek,

That's very helpful to have the state source. As a matter of fact I've just
this week been working on a similar tool (!) and have been wondering how to
recreate the schema of the state key and value. You've helped me a lot.
Thanks.

Jacek

On Wed, 26 Jun 2019, 23:58 Jungtaek Lim,  wrote:

> Hi,
>
> you could consider state operator's partition numbers as "max
> parallelism", as parallelism can be reduced via applying coalesce. It would
> be effectively working similar as key groups.
>
> If you're also considering offline query, there's a tool to manipulate
> state which enables reading and writing state in structured streaming,
> achieving rescaling and schema evolution.
>
> https://github.com/HeartSaVioR/spark-state-tools
> (DISCLAIMER: I'm an author of this tool.)
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei 
> wrote:
>
>> Thank you for your quick reply!
>>
>> Is there any plan to improve this?
>>
>> I asked this question due to some investigation on comparing those state
>> of art streaming systems, among which Flink and DataFlow allow changing
>> parallelism number, and by my knowledge of Spark Streaming, it seems it is
>> also able to do that: if some “key interval” concept is used, then state
>> can somehow decoupled from partition number by consistent hashing.
>>
>>
>>
>>
>>
>> Regards
>>
>> Jialei
>>
>>
>>
>> *From: *Jacek Laskowski 
>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>> *To: *"Rong, Jialei" 
>> *Cc: *"user @spark" 
>> *Subject: *Re: Change parallelism number in Spark Streaming
>>
>>
>>
>> Hi,
>>
>>
>>
>> It's not allowed to change the numer of partitions after your streaming
>> query is started.
>>
>>
>>
>> The reason is exactly the number of state stores which is exactly the
>> number of partitions (perhaps multiplied by the number of stateful
>> operators).
>>
>>
>>
>> I think you'll even get a warning or an exception when you change it
>> after restarting the query.
>>
>>
>>
>> The number of partitions is stored in a checkpoint location.
>>
>>
>>
>> Jacek
>>
>>
>>
>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
>> wrote:
>>
>> Hi Dear Spark Expert
>>
>>
>>
>> I’m curious about a question regarding Spark Streaming/Structured
>> Streaming: whether it allows to change parallelism number(the default one
>> or the one specified in particular operator) in a stream having stateful
>> transform/operator? Whether this will cause my checkpointed state get
>> messed up?
>>
>>
>>
>>
>>
>> Regards
>>
>> Jialei
>>
>>
>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski

Hi,

No idea. I've just begun exploring the current state of state management in
spark structured streaming. I'd not be surprised if what you're after were
not possible. Stateful stream processing in SSS is fairly young.

Jacek

On Wed, 26 Jun 2019, 21:48 Rong, Jialei,  wrote:

> Thank you for your quick reply!
>
> Is there any plan to improve this?
>
> I asked this question due to some investigation on comparing those state
> of art streaming systems, among which Flink and DataFlow allow changing
> parallelism number, and by my knowledge of Spark Streaming, it seems it is
> also able to do that: if some “key interval” concept is used, then state
> can somehow decoupled from partition number by consistent hashing.
>
>
>
>
>
> Regards
>
> Jialei
>
>
>
> *From: *Jacek Laskowski 
> *Date: *Wednesday, June 26, 2019 at 11:00 AM
> *To: *"Rong, Jialei" 
> *Cc: *"user @spark" 
> *Subject: *Re: Change parallelism number in Spark Streaming
>
>
>
> Hi,
>
>
>
> It's not allowed to change the numer of partitions after your streaming
> query is started.
>
>
>
> The reason is exactly the number of state stores which is exactly the
> number of partitions (perhaps multiplied by the number of stateful
> operators).
>
>
>
> I think you'll even get a warning or an exception when you change it after
> restarting the query.
>
>
>
> The number of partitions is stored in a checkpoint location.
>
>
>
> Jacek
>
>
>
> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
> wrote:
>
> Hi Dear Spark Expert
>
>
>
> I’m curious about a question regarding Spark Streaming/Structured
> Streaming: whether it allows to change parallelism number(the default one
> or the one specified in particular operator) in a stream having stateful
> transform/operator? Whether this will cause my checkpointed state get
> messed up?
>
>
>
>
>
> Regards
>
> Jialei
>
>
>
>

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Rong, Jialei

Fantastic, thanks!

From: Jungtaek Lim 
Date: Wednesday, June 26, 2019 at 2:59 PM
To: "Rong, Jialei" 
Cc: Jacek Laskowski , "user @spark" 
Subject: Re: Change parallelism number in Spark Streaming

Hi,

you could consider state operator's partition numbers as "max parallelism", as 
parallelism can be reduced via applying coalesce. It would be effectively 
working similar as key groups.

If you're also considering offline query, there's a tool to manipulate state 
which enables reading and writing state in structured streaming, achieving 
rescaling and schema evolution.

https://github.com/HeartSaVioR/spark-state-tools
(DISCLAIMER: I'm an author of this tool.)

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei  wrote:
Thank you for your quick reply!
Is there any plan to improve this?
I asked this question due to some investigation on comparing those state of art 
streaming systems, among which Flink and DataFlow allow changing parallelism 
number, and by my knowledge of Spark Streaming, it seems it is also able to do 
that: if some “key interval” concept is used, then state can somehow decoupled 
from partition number by consistent hashing.

Regards
Jialei

From: Jacek Laskowski mailto:ja...@japila.pl>>
Date: Wednesday, June 26, 2019 at 11:00 AM
To: "Rong, Jialei" 
Cc: "user @spark" mailto:user@spark.apache.org>>
Subject: Re: Change parallelism number in Spark Streaming

Hi,

It's not allowed to change the numer of partitions after your streaming query 
is started.

The reason is exactly the number of state stores which is exactly the number of 
partitions (perhaps multiplied by the number of stateful operators).

I think you'll even get a warning or an exception when you change it after 
restarting the query.

The number of partitions is stored in a checkpoint location.

Jacek

On Wed, 26 Jun 2019, 19:30 Rong, Jialei,  wrote:
Hi Dear Spark Expert

I’m curious about a question regarding Spark Streaming/Structured Streaming: 
whether it allows to change parallelism number(the default one or the one 
specified in particular operator) in a stream having stateful 
transform/operator? Whether this will cause my checkpointed state get messed up?

Regards
Jialei

--
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jungtaek Lim

Hi,

you could consider state operator's partition numbers as "max parallelism",
as parallelism can be reduced via applying coalesce. It would be
effectively working similar as key groups.

If you're also considering offline query, there's a tool to manipulate
state which enables reading and writing state in structured streaming,
achieving rescaling and schema evolution.

https://github.com/HeartSaVioR/spark-state-tools
(DISCLAIMER: I'm an author of this tool.)

Thanks,
Jungtaek Lim (HeartSaVioR)

On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei 
wrote:

> Thank you for your quick reply!
>
> Is there any plan to improve this?
>
> I asked this question due to some investigation on comparing those state
> of art streaming systems, among which Flink and DataFlow allow changing
> parallelism number, and by my knowledge of Spark Streaming, it seems it is
> also able to do that: if some “key interval” concept is used, then state
> can somehow decoupled from partition number by consistent hashing.
>
>
>
>
>
> Regards
>
> Jialei
>
>
>
> *From: *Jacek Laskowski 
> *Date: *Wednesday, June 26, 2019 at 11:00 AM
> *To: *"Rong, Jialei" 
> *Cc: *"user @spark" 
> *Subject: *Re: Change parallelism number in Spark Streaming
>
>
>
> Hi,
>
>
>
> It's not allowed to change the numer of partitions after your streaming
> query is started.
>
>
>
> The reason is exactly the number of state stores which is exactly the
> number of partitions (perhaps multiplied by the number of stateful
> operators).
>
>
>
> I think you'll even get a warning or an exception when you change it after
> restarting the query.
>
>
>
> The number of partitions is stored in a checkpoint location.
>
>
>
> Jacek
>
>
>
> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, 
> wrote:
>
> Hi Dear Spark Expert
>
>
>
> I’m curious about a question regarding Spark Streaming/Structured
> Streaming: whether it allows to change parallelism number(the default one
> or the one specified in particular operator) in a stream having stateful
> transform/operator? Whether this will cause my checkpointed state get
> messed up?
>
>
>
>
>
> Regards
>
> Jialei
>
>
>
>

-- 
Name : Jungtaek Lim
Blog : http://medium.com/@heartsavior
Twitter : http://twitter.com/heartsavior
LinkedIn : http://www.linkedin.com/in/heartsavior

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Rong, Jialei

Thank you for your quick reply!
Is there any plan to improve this?
I asked this question due to some investigation on comparing those state of art 
streaming systems, among which Flink and DataFlow allow changing parallelism 
number, and by my knowledge of Spark Streaming, it seems it is also able to do 
that: if some “key interval” concept is used, then state can somehow decoupled 
from partition number by consistent hashing.


Regards
Jialei

From: Jacek Laskowski 
Date: Wednesday, June 26, 2019 at 11:00 AM
To: "Rong, Jialei" 
Cc: "user @spark" 
Subject: Re: Change parallelism number in Spark Streaming

Hi,

It's not allowed to change the numer of partitions after your streaming query 
is started.

The reason is exactly the number of state stores which is exactly the number of 
partitions (perhaps multiplied by the number of stateful operators).

I think you'll even get a warning or an exception when you change it after 
restarting the query.

The number of partitions is stored in a checkpoint location.

Jacek

On Wed, 26 Jun 2019, 19:30 Rong, Jialei,  wrote:
Hi Dear Spark Expert

I’m curious about a question regarding Spark Streaming/Structured Streaming: 
whether it allows to change parallelism number(the default one or the one 
specified in particular operator) in a stream having stateful 
transform/operator? Whether this will cause my checkpointed state get messed up?


Regards
Jialei

Re: Change parallelism number in Spark Streaming

2019-06-26 Thread Jacek Laskowski

Hi,

It's not allowed to change the numer of partitions after your streaming
query is started.

The reason is exactly the number of state stores which is exactly the
number of partitions (perhaps multiplied by the number of stateful
operators).

I think you'll even get a warning or an exception when you change it after
restarting the query.

The number of partitions is stored in a checkpoint location.

Jacek

On Wed, 26 Jun 2019, 19:30 Rong, Jialei,  wrote:

> Hi Dear Spark Expert
>
>
>
> I’m curious about a question regarding Spark Streaming/Structured
> Streaming: whether it allows to change parallelism number(the default one
> or the one specified in particular operator) in a stream having stateful
> transform/operator? Whether this will cause my checkpointed state get
> messed up?
>
>
>
>
>
> Regards
>
> Jialei
>
>
>

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

Re: Change parallelism number in Spark Streaming

9 matches

Site Navigation

Mail list logo

Footer information