I have spark structured streaming job and I'm crunching through few terabytes 
of data.

I'm using file stream reader and it works flawlessly, I can adjust the 
partitioning of that with spark.default.parallelism
 
However I'm doing sessionization for the data after loading it and I'm 
currently locked with just 200 partitions for that stage. I've tried to 
repartition before and after the stateful map but it just adds new stage and so 
it's not very useful
 
Changing spark.sql.shuffle.partitions doesn't affect the count either.
 
val sessions = streamingSource // -> spark.default.parallelism defined amount 
of partitions/tasks (ie. 2000)
     .repartition(n) // -> n partitions/tasks
     .groupByKey(event => event.sessid) // -> stage opens / fixed 200 tasks
     .flatMapGroupsWithState(OutputMode.Append, 
GroupStateTimeout.EventTimeTimeout())(SessionState.updateSessionEvents) // -> 
fixed 200 tasks / stage closes


I tried to grep through spark source code but couldn’t find that param anywhere.


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to