Could you show a sample of the file names? There are multiple things that
are using UUIDs so would be good to see what are 100s of directories that
being generated every second.
If you are checkpointing every 400s then there shouldnt be checkpoint
directories written every second. They should be huge bunches written every
400s.

On Wed, Jan 6, 2016 at 3:13 PM, Jan Algermissen <algermissen1...@icloud.com>
wrote:

> Hi,
>
> we are running a streaming job that processes about 500 events per 20s
> batches and uses updateStateByKey to accumulate Web sessions (with a 30
> Minute live time).
>
> The checkpoint intervall is set to 20xBatchInterval, that is 400s.
>
> Cluster size is 8 nodes.
>
> We are having trouble with the amount of files and directories created on
> the shared file system (GlusterFS) - there are about 100 new directories
> per second.
>
> Is that the expected magnitude of number of created directories? Or should
> we expect something different?
>
> What might we be doing wrong?  Can anyone share a pointer to material that
> explains the details of checkpointing?
>
> The checkpoint directories have UUIDs as names - ist that correct?
>
> Jan
>
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to