Re: Change parallelism number in Spark Streaming

Jacek Laskowski Wed, 26 Jun 2019 15:54:23 -0700

Hi Jungtaek,

That's very helpful to have the state source. As a matter of fact I've just
this week been working on a similar tool (!) and have been wondering how to
recreate the schema of the state key and value. You've helped me a lot.
Thanks.


Jacek

On Wed, 26 Jun 2019, 23:58 Jungtaek Lim, <kabh...@gmail.com> wrote:

> Hi,
>
> you could consider state operator's partition numbers as "max
> parallelism", as parallelism can be reduced via applying coalesce. It would
> be effectively working similar as key groups.
>
> If you're also considering offline query, there's a tool to manipulate
> state which enables reading and writing state in structured streaming,
> achieving rescaling and schema evolution.
>
> https://github.com/HeartSaVioR/spark-state-tools
> (DISCLAIMER: I'm an author of this tool.)
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei <jia...@amazon.com.invalid>
> wrote:
>
>> Thank you for your quick reply!
>>
>> Is there any plan to improve this?
>>
>> I asked this question due to some investigation on comparing those state
>> of art streaming systems, among which Flink and DataFlow allow changing
>> parallelism number, and by my knowledge of Spark Streaming, it seems it is
>> also able to do that: if some “key interval” concept is used, then state
>> can somehow decoupled from partition number by consistent hashing.
>>
>>
>>
>>
>>
>> Regards
>>
>> Jialei
>>
>>
>>
>> *From: *Jacek Laskowski <ja...@japila.pl>
>> *Date: *Wednesday, June 26, 2019 at 11:00 AM
>> *To: *"Rong, Jialei" <jia...@amazon.com.invalid>
>> *Cc: *"user @spark" <user@spark.apache.org>
>> *Subject: *Re: Change parallelism number in Spark Streaming
>>
>>
>>
>> Hi,
>>
>>
>>
>> It's not allowed to change the numer of partitions after your streaming
>> query is started.
>>
>>
>>
>> The reason is exactly the number of state stores which is exactly the
>> number of partitions (perhaps multiplied by the number of stateful
>> operators).
>>
>>
>>
>> I think you'll even get a warning or an exception when you change it
>> after restarting the query.
>>
>>
>>
>> The number of partitions is stored in a checkpoint location.
>>
>>
>>
>> Jacek
>>
>>
>>
>> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, <jia...@amazon.com.invalid>
>> wrote:
>>
>> Hi Dear Spark Expert
>>
>>
>>
>> I’m curious about a question regarding Spark Streaming/Structured
>> Streaming: whether it allows to change parallelism number(the default one
>> or the one specified in particular operator) in a stream having stateful
>> transform/operator? Whether this will cause my checkpointed state get
>> messed up?
>>
>>
>>
>>
>>
>> Regards
>>
>> Jialei
>>
>>
>>
>>
>
> --
> Name : Jungtaek Lim
> Blog : http://medium.com/@heartsavior
> Twitter : http://twitter.com/heartsavior
> LinkedIn : http://www.linkedin.com/in/heartsavior
>

Re: Change parallelism number in Spark Streaming

Reply via email to