Hi Bojan,

Sorry about being late in responding to this.

Your setup is of course possible, using host headers, or just headers supplied by whatever is feeding the data to flume.

The issue is when hdfs has to write to 120 different files, batches get split up over each file, so the writes are not particularly efficient.

One approach to this is to just write everything to the same file, and then post process it. Another is to try and group files to sinks. You could use an interceptor to feed specific header(s) to a specific channel. There are multiple strategies to this each with their own benefits and disadvantages, but in the end of the day, writing one big hdfs file is far more efficient than writing lots of small ones.

On 11/06/2013 06:39 PM, Bojan Kostić wrote:

It was late when i wrote last mail, and my explanation was not clear.
I will illustrate:
20 servers, every one with 60 different log files.
I was thinking that I could have this kind of structure on hdfs:
/logs/server0/logstat0.log
/logs/server0/logstat1.log
.
.
.
/logs/server20/logstat0.log
.
.
.

But from your info I see that I can't do that.
I could try to add server id column in every file and then aggregate files from all files servers to one file
/logs/logstat0.log
/logs/logstat1.log
.
.
.

But again I should have 60 sinks.

On Nov 6, 2013 2:02 AM, "Roshan Naik" <[email protected] <mailto:[email protected]>> wrote:

    I assume you mean  you have 120 source files to be streamed into
    HDFS.
    There is not a 1-1 correspondence between source files and
    destination hdfs files.  If they are on the same host, you can
    have them all picked up through one source, once channel and one
    hdfs sink... winding up in a single hdfs file.

    In case you have a config with multiple HDFS sinks (part of a
    single agent or spanning multiple agents) you want to ensure each
    HDFS sink writes to a separate file in HDFS.


    On Tue, Nov 5, 2013 at 4:41 PM, Bojan Kostić
    <[email protected] <mailto:[email protected]>> wrote:

        Hallo Roshan,

        Thanks for response.
        Bit I am now confused. If I have 120 files, do I need to
        configure 120 sinks/sources/channels separately? Or I have
        missed something in the docs.
        Maybe I should use Fan out flow? But then again I must set 120
        params.

        Best regards.

        On Nov 5, 2013 8:47 PM, "Roshan Naik" <[email protected]
        <mailto:[email protected]>> wrote:

            yes. to avoid them clobbering each other's writes.


            On Tue, Nov 5, 2013 at 4:34 AM, Bojan Kostić
            <[email protected] <mailto:[email protected]>> wrote:

                Sorry for late response. But I lost this email somehow.

                Thanks for the read, it is nice start even it is old.
                And the numbers are really promising.

                I'm testing memory chanel, there is like 20 data
                sources(log servers) with 60 different files each.

                My RPC client app is basic like in examples. But it
                have load balancing for two flume agents which are
                writing data to hdfs.

                I think I read somewhere that you should have one sink
                per file. Is that true?

                Best regards, and sorry again for late response.

                On Oct 22, 2013 8:50 AM, "Juhani Connolly"
                <[email protected]
                <mailto:[email protected]>> wrote:

                    Hi Bojan,

                    This is pretty old, but Mike did some testing on
                    performance about an year and a half ago:

                    
https://cwiki.apache.org/confluence/display/FLUME/Flume+NG+Syslog+Performance+Test+2012-04-30

                    He was getting a max of 70k events/sec on a single
                    machine.

                    Thing is, this is a result of a huge number of
                    variables:
                    - Parallelization of flows allows better parallel
                    processing
                    - Use of memory channel as opposed to a slower
                    consistent channel.
                    - Possibly the source. I have no idea how you
                    wrote your app
                    - Batching of events is important. Also are all
                    events written to one file? Or are they split over
                    many? Every file is separately processed.
                    - Network congestion, your hdfs setup

                    Reaching 100k events per second is definitely
                    possible. The resources you need for it will vary
                    significantly depending on how your setup is. The
                    more HA type features you use, the slower delivery
                    is likely to become. On the flipside, allowing
                    fairly lax conditions that have a small potential
                    for data loss(on crash for example memory channel
                    contents are gone) will allow for close to 100k
                    even on a single machine.

                    On 10/14/2013 09:00 PM, Bojan Kostić wrote:

                        Hi, this is my first post here. But i play
                        with flume for some time now.
                        My question is how well flume scale?
                        Can Flume ingest +100k events per seccond? Has
                        anyone tried something like this?

                        I created simple test and results are really slow.
                        I wrote simple app with rpc client with
                        fallback using flume sdk which is reading
                        dummy log file.
                        In the end i have two flume agents which are
                        writing to hdfs.
                        rollInterval = 60
                        And in hdfs i get files with ~12MB.

                        Do i need to use some complex topology with 3
                        tier?
                        How many flume agents should write to hdfs?

                        Best regards.




            CONFIDENTIALITY NOTICE
            NOTICE: This message is intended for the use of the
            individual or entity to which it is addressed and may
            contain information that is confidential, privileged and
            exempt from disclosure under applicable law. If the reader
            of this message is not the intended recipient, you are
            hereby notified that any printing, copying, dissemination,
            distribution, disclosure or forwarding of this
            communication is strictly prohibited. If you have received
            this communication in error, please contact the sender
            immediately and delete it from your system. Thank You.



    CONFIDENTIALITY NOTICE
    NOTICE: This message is intended for the use of the individual or
    entity to which it is addressed and may contain information that
    is confidential, privileged and exempt from disclosure under
    applicable law. If the reader of this message is not the intended
    recipient, you are hereby notified that any printing, copying,
    dissemination, distribution, disclosure or forwarding of this
    communication is strictly prohibited. If you have received this
    communication in error, please contact the sender immediately and
    delete it from your system. Thank You.


Reply via email to