Hi Jungtaek, That's very helpful to have the state source. As a matter of fact I've just this week been working on a similar tool (!) and have been wondering how to recreate the schema of the state key and value. You've helped me a lot. Thanks.
Jacek On Wed, 26 Jun 2019, 23:58 Jungtaek Lim, <kabh...@gmail.com> wrote: > Hi, > > you could consider state operator's partition numbers as "max > parallelism", as parallelism can be reduced via applying coalesce. It would > be effectively working similar as key groups. > > If you're also considering offline query, there's a tool to manipulate > state which enables reading and writing state in structured streaming, > achieving rescaling and schema evolution. > > https://github.com/HeartSaVioR/spark-state-tools > (DISCLAIMER: I'm an author of this tool.) > > Thanks, > Jungtaek Lim (HeartSaVioR) > > On Thu, Jun 27, 2019 at 4:48 AM Rong, Jialei <jia...@amazon.com.invalid> > wrote: > >> Thank you for your quick reply! >> >> Is there any plan to improve this? >> >> I asked this question due to some investigation on comparing those state >> of art streaming systems, among which Flink and DataFlow allow changing >> parallelism number, and by my knowledge of Spark Streaming, it seems it is >> also able to do that: if some “key interval” concept is used, then state >> can somehow decoupled from partition number by consistent hashing. >> >> >> >> >> >> Regards >> >> Jialei >> >> >> >> *From: *Jacek Laskowski <ja...@japila.pl> >> *Date: *Wednesday, June 26, 2019 at 11:00 AM >> *To: *"Rong, Jialei" <jia...@amazon.com.invalid> >> *Cc: *"user @spark" <user@spark.apache.org> >> *Subject: *Re: Change parallelism number in Spark Streaming >> >> >> >> Hi, >> >> >> >> It's not allowed to change the numer of partitions after your streaming >> query is started. >> >> >> >> The reason is exactly the number of state stores which is exactly the >> number of partitions (perhaps multiplied by the number of stateful >> operators). >> >> >> >> I think you'll even get a warning or an exception when you change it >> after restarting the query. >> >> >> >> The number of partitions is stored in a checkpoint location. >> >> >> >> Jacek >> >> >> >> On Wed, 26 Jun 2019, 19:30 Rong, Jialei, <jia...@amazon.com.invalid> >> wrote: >> >> Hi Dear Spark Expert >> >> >> >> I’m curious about a question regarding Spark Streaming/Structured >> Streaming: whether it allows to change parallelism number(the default one >> or the one specified in particular operator) in a stream having stateful >> transform/operator? Whether this will cause my checkpointed state get >> messed up? >> >> >> >> >> >> Regards >> >> Jialei >> >> >> >> > > -- > Name : Jungtaek Lim > Blog : http://medium.com/@heartsavior > Twitter : http://twitter.com/heartsavior > LinkedIn : http://www.linkedin.com/in/heartsavior >