The relevant config is spark.sql.shuffle.partitions.  Note that once you
start a query, this number is fixed.  The config will only affect queries
starting from an empty checkpoint.

On Wed, Nov 8, 2017 at 7:34 AM, Teemu Heikkilä <te...@emblica.fi> wrote:

> I have spark structured streaming job and I'm crunching through few
> terabytes of data.
>
> I'm using file stream reader and it works flawlessly, I can adjust the
> partitioning of that with spark.default.parallelism
>
> However I'm doing sessionization for the data after loading it and I'm
> currently locked with just 200 partitions for that stage. I've tried to
> repartition before and after the stateful map but it just adds new stage
> and so it's not very useful
>
> Changing spark.sql.shuffle.partitions doesn't affect the count either.
>
> val sessions = streamingSource // -> spark.default.parallelism defined
> amount of partitions/tasks (ie. 2000)
>      .repartition(n) // -> n partitions/tasks
>      .groupByKey(event => event.sessid) // -> stage opens / fixed 200 tasks
>      .flatMapGroupsWithState(OutputMode.Append, GroupStateTimeout.
> EventTimeTimeout())(SessionState.updateSessionEvents) // -> fixed 200
> tasks / stage closes
>
>
> I tried to grep through spark source code but couldn’t find that param
> anywhere.
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Reply via email to