Re: preserve syslog header in hdfs sink

Ryan Suarez Wed, 02 Apr 2014 11:23:04 -0700

Ok, I've added hdfs.fileType = datastream and sink.serializer =header_and_text. But I'm still seeing the logs written in sequenceformat. Any ideas?


-----
flume@hadoop-t1:~$ flume-ng version
Flume 1.4.0.2.0.11.0-1
Source code repository: https://git-wip-us.apache.org/repos/asf/flume.git
Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
From source with checksum dea9ae30ce2c27486ae7c76ab7aba020



-----
root@hadoop-t1:/etc/flume/conf# cat flume-conf.properties
# Name the components on this agent
hadoop-t1.sources = r1
hadoop-t1.sinks = s1
hadoop-t1.channels = mem1

# Describe/configure the source
hadoop-t1.sources.r1.type = syslogtcp
hadoop-t1.sources.r1.host = localhost
hadoop-t1.sources.r1.port = 10005
hadoop-t1.sources.r1.portHeader = port
hadoop-t1.sources.r1.interceptors = i1 i2
hadoop-t1.sources.r1.interceptors.i1.type = timestamp
hadoop-t1.sources.r1.interceptors.i2.type = host
hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname

##HDFS Sink
hadoop-t1.sinks.s1.type = hdfs
hadoop-t1.sinks.s1.fileType = *DataStream*

hadoop-t1.sinks.s1.hdfs.path =hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d

hadoop-t1.sinks.s1.hdfs.batchSize = 1
hadoop-t1.sinks.s1.serializer = *header_and_text*
hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
hadoop-t1.sinks.s1.serializer.format = CSV
hadoop-t1.sinks.s1.serializer.appendNewline = true

## MEM  Use a channel which buffers events in memory
hadoop-t1.channels.mem1.type = memory
hadoop-t1.channels.mem1.capacity = 1000
hadoop-t1.channels.mem1.transactionCapacity = 100

# Bind the source and sink to the channel
hadoop-t1.sources.r1.channels = mem1
hadoop-t1.sinks.s1.channel = mem1

On 14-04-01 12:13 PM, Jeff Lord wrote:

Well you are writing a sequence file (default) Is that what you want?
If you want text use:

hdfs.fileType = datastream

and for the serializer you should be able to just use:

a1.sinks.k1.sink.serializer = header_and_text

On Tue, Apr 1, 2014 at 8:02 AM, Ryan Suarez<[email protected]<mailto:[email protected]>> wrote:


    Thanks for the tip!  I was indeed missing the interceptors.  I've
    added them now but the timestamp and hostname is still not showing
    up in the hdfs log.  Any advice?


    ------- sample event in HDFS ------
    SEQ
    
!org.apache.hadoop.io.LongWritable”org.apache.hadoop.io.BytesWritable������cc�c��I�[��ڳ\�����`���
    �� E � ����Tsu[28432]: pam_unix(su:session): session opened for
    user root by myuser(uid=31043)

    ------ same event in syslog ------
    Mar 31 16:18:32 hadoop-t1 su[28432]: pam_unix(su:session): session
    opened for user root by myuser(uid=31043)

    ------- flume-conf.properties --------

    # Name the components on this agent
    hadoop-t1.sources = r1
    hadoop-t1.sinks = s1

    hadoop-t1.channels = mem1

    # Describe/configure the source
    hadoop-t1.sources.r1.type = syslogtcp
    hadoop-t1.sources.r1.host = localhost
    hadoop-t1.sources.r1.port = 10005
    hadoop-t1.sources.r1.portHeader = port
    hadoop-t1.sources.r1.interceptors = i1 i2
    hadoop-t1.sources.r1.interceptors.i1.type = timestamp
    hadoop-t1.sources.r1.interceptors.i2.type = host
    hadoop-t1.sources.r1.interceptors.i2.hostHeader = hostname

    ##HDFS Sink
    hadoop-t1.sinks.s1.type = hdfs
    hadoop-t1.sinks.s1.hdfs.path =
    hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
    <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
    hadoop-t1.sinks.s1.hdfs.batchSize = 1
    hadoop-t1.sinks.s1.serializer =
    org.apache.flume.serialization.HeaderAndBodyTextEventSerializer$Builder
    hadoop-t1.sinks.s1.serializer.columns = timestamp hostname
    hadoop-t1.sinks.s1.serializer.format = CSV
    hadoop-t1.sinks.s1.serializer.appendNewline = true

    ## MEM  Use a channel which buffers events in memory

    hadoop-t1.channels.mem1.type = memory
    hadoop-t1.channels.mem1.capacity = 1000
    hadoop-t1.channels.mem1.transactionCapacity = 100

    # Bind the source and sink to the channel
    hadoop-t1.sources.r1.channels = mem1
    hadoop-t1.sinks.s1.channel = mem1



    On 14-03-28 3:37 PM, Jeff Lord wrote:

    Do you have the appropriate interceptors configured?


    On Fri, Mar 28, 2014 at 12:28 PM, Ryan Suarez
    <[email protected]
    <mailto:[email protected]>> wrote:

        RTFM indicates I need the following sink properties:

        ---
        hadoop-t1.sinks.hdfs1.serializer =
        org.apache.flume.serialization.HeaderAndBodyTextEventSerializer
        hadoop-t1.sinks.hdfs1.serializer.columns = timestamp hostname msg
        hadoop-t1.sinks.hdfs1.serializer.format = CSV
        hadoop-t1.sinks.hdfs1.serializer.appendNewline = true
        ---

        But I'm still not getting timestamp information.  How would I
        get hostname and timestamp information in the logs?


        On 14-03-26 3:02 PM, Ryan Suarez wrote:

            Greetings,

            I'm running flume that's shipped with Hortonworks HDP2 to
            feed syslogs to hdfs.  The problem is the timestamp and
            hostname of the event is not logged to hdfs.

            ---
            flume@hadoop-t1:~$ hadoop fs -cat
            /opt/logs/hadoop-t1/2014-03-26/FlumeData.1395859766307
            SEQ!org.apache.hadoop.io
            
<http://org.apache.hadoop.io>.LongWritable"org.apache.hadoop.io.BytesWritable??Ak?i<??G??`D??$hTsu[22209]:
            pam_unix(su:session): session opened for user root by
            someuser(uid=11111)
            ---

            How do I configure the sink to add hostname and timestamp
            info the the event?

            Here's my flume-conf.properties:

            ---
            flume@hadoop-t1:/etc/flume/conf$ cat flume-conf.properties
            # Name the components on this agent
            hadoop-t1.sources = syslog1
            hadoop-t1.sinks = hdfs1
            hadoop-t1.channels = mem1

            # Describe/configure the source
            hadoop-t1.sources.syslog1.type = syslogtcp
            hadoop-t1.sources.syslog1.host = localhost
            hadoop-t1.sources.syslog1.port = 10005
            hadoop-t1.sources.syslog1.portHeader = port

            ##HDFS Sink
            hadoop-t1.sinks.hdfs1.type = hdfs
            hadoop-t1.sinks.hdfs1.hdfs.path =
            hdfs://hadoop-t1.mydomain.org:8020/opt/logs/%{host}/%Y-%m-%d
            <http://hadoop-t1.mydomain.org:8020/opt/logs/%%7Bhost%7D/%Y-%m-%d>
            hadoop-t1.sinks.hdfs1.hdfs.batchSize = 1

            # Use a channel which buffers events in memory
            hadoop-t1.channels.mem1.type = memory
            hadoop-t1.channels.mem1.capacity = 1000
            hadoop-t1.channels.mem1.transactionCapacity = 100

            # Bind the source and sink to the channel
            hadoop-t1.sources.syslog1.channels = mem1
            hadoop-t1.sinks.hdfs1.channel = mem1
            ---

            ---
            flume@hadoop-t1:~$ flume-ng version
            Flume 1.4.0.2.0.11.0-1
            Source code repository:
            https://git-wip-us.apache.org/repos/asf/flume.git
            Revision: fcdc3d29a1f249bef653b10b149aea2bc5df892e
            Compiled by jenkins on Wed Mar 12 05:11:30 PDT 2014
            From source with checksum dea9ae30ce2c27486ae7c76ab7aba020
            ---

Re: preserve syslog header in hdfs sink

Reply via email to