Re: Roll based on date

David Sinclair Thu, 24 Oct 2013 06:50:42 -0700

How often are your events coming in?


On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <[email protected]> wrote:

> Hi David,
>
> Thanks for the example. I have set it just like above, but it only
> generate for the first 15 minutes. After waiting for more than one hour,
> there is no update at all in the s3 bucket.
>
> Thanks.
>
> Martinus
>
>
> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
> [email protected]> wrote:
>
>> You can set all of the time/size based rolling policies to zero and set
>> an idle timeout on the sink. Below has a 15 minute timeout
>>
>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>> agent.sinks.sink.hdfs.fileType = DataStream
>> agent.sinks.sink.hdfs.rollInterval = 0
>> agent.sinks.sink.hdfs.rollSize = 0
>> agent.sinks.sink.hdfs.batchSize = 0
>> agent.sinks.sink.hdfs.rollCount = 0
>> agent.sinks.sink.hdfs.idleTimeout = 900
>>
>>
>>
>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[email protected]>wrote:
>>
>>> Hi David,
>>>
>>> The requirement is only roll per day actually.
>>>
>>> Hi Devin,
>>>
>>> Thanks for sharing your experienced. I also tried to set the config as
>>> following :
>>>
>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>> agent.sinks.sink.hdfs.fileType = DataStream
>>> agent.sinks.sink.hdfs.rollInterval = 0
>>> agent.sinks.sink.hdfs.rollSize = 0
>>> agent.sinks.sink.hdfs.batchSize = 15000
>>> agent.sinks.sink.hdfs.rollCount = 0
>>>
>>> But I didn't see anything on the s3 bucket. So I guess, I need to change
>>> the rollInterval into 86400. In my understanding, rollInterval 86400 will
>>> roll the file after 24 hours like you said, but it will not generate new
>>> file if it's changed the day and haven't been 24 hours interval (unless we
>>> put prefix to fileSuffix as above).
>>>
>>> Thanks to both of you.
>>>
>>> Best regards,
>>>
>>> Martinus
>>>
>>>
>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[email protected]> wrote:
>>>
>>>> Martinus, you have to set all the other roll options to 0 explicitly in
>>>> the configuration if you want them only to roll on one parameter, it will
>>>> take the shortest working parameter it can meet for the roll. If you want
>>>> it to roll once a day, you will have to specifically disable all the other
>>>> options for roll triggers - they all take default settings unless told not
>>>> to. When I was experimenting, for example, it kept rolling in 30 seconds
>>>> even though I had the hdfs.rollSize set to 64MB (our test data is generated
>>>> slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a
>>>> bunch of directories sorted by timestamp in ten-minute intervals.
>>>>
>>>> So, maybe a conf like this:
>>>>
>>>> agent.sinks.sink.type = hdfs
>>>> agent.sinks.sink.channel = channel
>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>>> agent.sinks.sink.serializer = avro_event
>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>
>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file size
>>>> for the file, and will close the file if it has 15000 events in it, but if
>>>> the hdfs.rollCount line was not set to "0" or some higher value (I probably
>>>> could have set that at 15000 to match the hdfs.batchSize for same results)
>>>> then the file would roll as soon as the default of only 10 events were
>>>> written in to the file.
>>>>
>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>>>> collect from syslogTCP which comes from remote host. It then goes to avro
>>>> sink to aggregate the small event entries into larger avro files. Then, a
>>>> second tier collects that with avro source, then hdfs sink. So, we get them
>>>> all as individual events streamed into an avro container, then the avro
>>>> container is put into HDFS every 24 hours or if it hits 128 MB. We were
>>>> getting many small files because of the lower velocity of our sample set,
>>>> and we did not want to clutter up FSImage. The avro serializer and
>>>> DataStream type are necessary also, because the default behavior of HDFS
>>>> sink is to put things in as SequenceFile format.
>>>>
>>>> Hope this helps you out.
>>>>
>>>> Sincerely,
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>
>>>>
>>>> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <
>>>> [email protected]> wrote:
>>>>
>>>>> Do you need to roll based on size as well? Can you tell me the
>>>>> requirements?
>>>>>
>>>>>
>>>>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <[email protected]>wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> Thanks for your answer. I already did that, but using %Y-%m-%d. But,
>>>>>> since there are still roll based on Size, so it will keep generating two 
>>>>>> or
>>>>>> mores FlumeData.%Y-%m-%d with different postfix.
>>>>>>
>>>>>> Thanks.
>>>>>>
>>>>>> Martinus
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> The SyslogTcpSource will put a header on the flume event named
>>>>>>> 'timestamp'. This timestamp will be from the syslog entry. You could 
>>>>>>> then
>>>>>>> set the filePrefix in the sink to grab this out.
>>>>>>> For example
>>>>>>>
>>>>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp}
>>>>>>>
>>>>>>> dave
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m 
>>>>>>> <[email protected]>wrote:
>>>>>>>
>>>>>>>> Hi David,
>>>>>>>>
>>>>>>>> It's syslogtcp.
>>>>>>>>
>>>>>>>> Thanks.
>>>>>>>>
>>>>>>>> Martinus
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> What type of source are you using?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <[email protected]
>>>>>>>>> > wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Is there any option in HDFS sink that I can start rolling a new
>>>>>>>>>> file whenever the date in the log change? For example, I got below 
>>>>>>>>>> logs :
>>>>>>>>>>
>>>>>>>>>> Oct 16 23:58:56 test-host : just test
>>>>>>>>>> Oct 16 23:59:51 test-host : test again
>>>>>>>>>> Oct 17 00:00:56 test-host : just test
>>>>>>>>>> Oct 17 00:00:56 test-host : test again
>>>>>>>>>>
>>>>>>>>>> Then I want it to make a file on S3 bucket with result like this :
>>>>>>>>>>
>>>>>>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with Oct 16
>>>>>>>>>> from this year 2013 will goes to here and when it's reach Oct 17 
>>>>>>>>>> year 2013,
>>>>>>>>>> then it will start to sink into a new file below :
>>>>>>>>>>
>>>>>>>>>> FlumeData.2013-10-17.1381940047117
>>>>>>>>>>
>>>>>>>>>> Thanks.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Roll based on date

Reply via email to