Re: checkpoint lifecycle

Brock Noland Thu, 30 Jan 2014 08:29:40 -0800

On Thu, Jan 30, 2014 at 9:29 AM, Umesh Telang <[email protected]>wrote:


>  Ah, ok. So 32 bytes is required for each pointer to an event.
>

Yep :)


> We'll amend our heap size accordingly. We may also be able to reduce our
> FileChannel size. We hadn't understood the implications of the capacity
> value of the FileChannel we have been using.
>
>  Regarding the multiple data directories, I hadn't realised that that
> implied distinct disks. Just to confirm, you're saying that each data
> directory has to be on a distinct disk?
>

The recommendation is that you have two data directories per distinct disk.


> Is it that FileChannel can't utilise an entire disk from an IO
> perspective, regardless of how big the disk is?
>

Right, it has nothing to do with size and everything todo with IO
bandwidth. We could optimize this area (and will) but for now specifying
two data directories per disk is a good workaround.


> Or is this size-dependent? i.e above a certain size, you need a second
> data directory? If the latter, could you let me know what that size is?
> If it's a general point, then I'll follow the earlier advice of 2 data
> dirs per file channel.
>

Doesn't relate to size.


>
>  Apologies for all the questions!
>
>  We had made an estimation of disk space (avg event size (~250 bytes)  *
> channel size (150M)) and have provisioned disks that are significantly
> larger than the required space.
>

Perfect, great to hear!

>
>  Thanks,
> Umesh
>
>  ------------------------------
> *From:* Brock Noland [[email protected]]
> *Sent:* 30 January 2014 14:38
>
> *To:* [email protected]
> *Subject:* Re: checkpoint lifecycle
>
>    On Thu, Jan 30, 2014 at 8:16 AM, Umesh Telang 
> <[email protected]>wrote:
>
>>  Hi Brock,
>>
>>  Our heap size is 2GB.
>>
>
>  That is not enough heap for 150M events. It's 150 million * 32 bytes =
> 4.5GB + say 100-500MB for the rest of Flume.
>
>
>>
>>  Thanks for the advice on data directories. Could you please let me know
>> the heuristic for that?   (e.g. 1 data directory per N-sized channel where
>> N is...)
>>
>
>  File channel at present cannot utilize an entire disk from a IO
> perspective, that is why I suggest multiple disks. Of course you'll want to
> ensure that you have enough disk to support a full channel, but that is a
> different discussion (avg event size * channel size).
>
>
>>
>>  Thanks also for suggesting back up checkpoints - are these something
>> that increases the integrity of Flume's execution in an automatic fashion,
>> or does it aid in some form of manual recovery?
>>
>
>  Automatic. If flume is killed or shutdown during a checkpoint that
> checkpoint is invalid and unless a backup checkpoint exists a full replay
> will have to take place. Furthermore, without FLUME-2155 full replays are
> very time consuming under certain conditions.
>
>
>>
>>  Re: FLUME-2155, I've scanned through it, and will read it in more
>> detail. I'm not sure about the unit of measurement for some of the metrics
>> (milliseconds?), but is there any guidance as to at which order of
>> magnitude (10^4, 10^6 or 10^8 ?) the channel size causes the replay issue
>> to become apparent?
>>
>
>  It's not purely about channel size. Specifically it's about:
>
>  1) Large channel size
> 2) Having a large number of events in your channel (queue depth)
> 3) Having run the channel for some time such that old WAL's were cleaned
> up (causing there to be removes for which no event exists)
> 4) Performing a full replay in these conditions
>
>  Generally I wouldn't go over a 1M channel size without backup
> checkpoint, this change, or both. There are more details here:
>
>
> https://issues.apache.org/jira/browse/FLUME-2155?focusedCommentId=13841465&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13841465
>
>  Brock
>
>
>
> ----------------------------
>
>
> http://www.bbc.co.uk
> This e-mail (and any attachments) is confidential and may contain personal
> views which are not the views of the BBC unless specifically stated.
> If you have received it in error, please delete it from your system.
> Do not use, copy or disclose the information in any way nor act in
> reliance on it and notify the sender immediately.
> Please note that the BBC monitors e-mails sent or received.
> Further communication will signify your consent to this.
>
> ---------------------
>



-- 
Apache MRUnit - Unit testing MapReduce - http://mrunit.apache.org

Re: checkpoint lifecycle

Reply via email to