It is hard to calculate, it very much depends on the job:

 - Is it running filters that reduce the data volume early?
 - It it possibly running operations that blow up the size of an
intermediate result?

I would in general assume to use as much temp space than the input data
size, unless you have a lot of RAM such that the system can process the job
completely in memory.
Unless you immediately filter the input data aggressively (you have a
highly selective filter function after the readTextFile(...) or so).

Stephan





On Mon, Nov 10, 2014 at 5:59 PM, Malte Schwarzer <[email protected]> wrote:

> What's the estimated amount of disk space for such a job? Or how can I
> calculate it?
>
> Malte
>
> Von: Stephan Ewen <[email protected]>
> Antworten an: <[email protected]>
> Datum: Montag, 10. November 2014 11:22
> An: <[email protected]>
> Betreff: Re: How to make Flink to write less temporary files?
>
> Hi!
>
> With 10 nodes and 25 GB on each node, you have 250 GB space to spill
> temporary files. You also seem to have roughly the same size in JVM Heap,
> out of which Flink can use roughly 2/3.
>
> When you process 1 TB, 250 GB JVM heap and 250 GB temp file space may not
> be enough, it is less than the initial data size.
>
> I think you need simply need more disk space for a job like that...
>
> Stephan
>
>
>
>
> On Mon, Nov 10, 2014 at 10:54 AM, Malte Schwarzer <[email protected]> wrote:
>
>> My blobStore fileds are small, but each *.channel file is around 170MB.
>> Before I start by Flink job I’ve 25GB free space available in my tmp-dir
>> and my taskmanager heap size is currently at 24GB. I’m using a cluster with
>> 10 nodes.
>>
>> Is this enough space to process a 1TB file?
>>
>> Von: Stephan Ewen <[email protected]>
>> Antworten an: <[email protected]>
>> Datum: Montag, 10. November 2014 10:35
>> An: <[email protected]>
>> Betreff: Re: How to make Flink to write less temporary files?
>>
>> I would assume that the blobStore fields are rather small (they are only
>> jar files so far).
>>
>> I would look for *.channel files, which are spilled intermediate results.
>> They can get pretty large for large jobs.
>>
>
>

Reply via email to