Re: Collecting thousands of sources

JuanFra Rodriguez Cardoso Fri, 05 Sep 2014 06:44:01 -0700

Got it Ashish!
Thanks for your support and suggestions

Cheers,
JuanFra
El 05/09/2014 13:28, "Ashish" <[email protected]> escribió:


>
>
>
> On Fri, Sep 5, 2014 at 4:01 PM, JuanFra Rodriguez Cardoso <
> [email protected]> wrote:
>
>> Thanks, both of you!
>>
>> @Ashish, Javi's thoughts are right. My use case is focused on sources for
>> consuming SNMP traps. I came here from the already open discussion [1] with
>> hopes anyone was facing with this problem.
>>
>> Your solution based on Async SNMP walker would help us to scale that
>> thousands of agents, but it would reduce any sort of scenario to the same
>> process:
>>
>> 1. Code a custom collector (async or not) for sending data to flume spool
>> dir
>> 2. Agents' sources would consume data from that dir.
>>
>
> You might not need to code a Custom Collector, NMS system do that already.
> So if you have an NMS system in place, may be it can do this polling for
> you and dump records, where Flume can pick the same.
>
> If this is not an option, you need to write a custom collector, standalone
> or within Flume Source.
> I went through the SNMP Source, and have a suggestion. If PDU decoding can
> be avoided, it would save a lot of CPU at collection tier. No action is
> being taken in Source, so raw PDU can be offloaded to channel. I wrote SNMP
> ping long back. Problem statement was similar, poll SNMP agents for
> specific OID's. I didn't use SNMP lib directly, I just just it to encode
> and decode packet and managed network layer myself.
>
>
>> Don't you think it would be more suitable to include an option in
>> flume.conf (path/to/list-of-thousands-sources) as Javi commented above?
>> This way, agent's configuration would be easier to manage.
>>
>
> I think I agreed to this option :)
>
>
>>
>> [1] https://issues.apache.org/jira/browse/FLUME-2039
>>
>> Regards,
>> ---
>> JuanFra Rodriguez Cardoso
>>
>>
>> 2014-09-05 10:20 GMT+02:00 Ashish <[email protected]>:
>>
>>> Jovi,
>>>
>>> I have NMS background, so understand your concern.
>>>
>>> Answers inline
>>>
>>>
>>> On Fri, Sep 5, 2014 at 12:44 PM, Javi Roman <[email protected]>
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> The scenario which is describing Juanfra is related with the question
>>>> I made a few days ago [1].
>>>>
>>>
>>> I am not sure about this, Juanfra is a better person to comment.
>>>
>>>
>>>>
>>>> You can not install Flume agents in the SNMP managed devices, and you
>>>> can not modify any software in the SNMP managed devide for use Flume
>>>> client SDK (if I understand correctly your idea Ashish). There are two
>>>> ways for SNMP data collection from SNMP devices using Flume (IMHO):
>>>>
>>>
>>> Agreed.
>>>
>>>
>>>>
>>>> 1. To create a custom application which launches the SNMP queries to
>>>> the thousand of devices, and log the answer into a file: In this case
>>>> Flume can sniff this file with the "exec source" core plugin (tail).
>>>>
>>>
>>> IMHO, would be preferred way for me. Create a Simple SNMP walker
>>> which polls Nodes in parallel and writes responses in file.
>>> Use Flume's Spool Dir Source and rest flow remains same.
>>> I would avoid decoding Events unless it needs to be interpreted in Flume
>>> chain
>>>
>>>
>>>>
>>>> 2. To use a flume-snmp-source plugin (similar to [2]), in other words,
>>>> to shift the SNMP query custom application into a specialized Flume
>>>> plugin.
>>>>
>>>
>>> Possible, it's like running #1 solution inside Flume.
>>> For both you would need to maintain which all Agents have been sent
>>> requests.
>>> Async SNMP walker would help you scale to thousands of Agents.
>>>
>>>
>>>>
>>>> Juanfra is talking about a scenario like the second point. In that
>>>> case you have to handle a huge flume configuration file, with an entry
>>>> for each managed device to query. For this situation I guess there are
>>>> two possible solutions:
>>>>
>>>> 1. The flume-snmp-source plugin can use a file with a list of host to
>>>> query as parameter:
>>>>
>>>> agent.sources.source1.host = /path/to/list-of-host-file
>>>>
>>>> However I guess this breaks the philosophy or the simplicity of other
>>>> core plugins of Flume.
>>>>
>>>> 2.  Create a little program to fill the flume configuration file with
>>>> a template, or something similar.
>>>>
>>>
>>> I would go with #1, it keep Flume config file simple. We still need to
>>> distribute the file but on a small scale.
>>>
>>>
>>>>
>>>>
>>>> Any other ideas? I guess this is a good discussion about a real world
>>>> use case.
>>>>
>>>>
>>>> [1]
>>>> http://mail-archives.apache.org/mod_mbox/flume-user/201409.mbox/browser
>>>> [2] https://github.com/javiroman/flume-snmp-source
>>>>
>>>> On Fri, Sep 5, 2014 at 4:56 AM, Ashish <[email protected]> wrote:
>>>> >
>>>> > Have a look at Flume Client SDK. One simple way would be to use Flume
>>>> clients implementations to send Events to Flume Sources, this would
>>>> significantly reduce the number of Sources you need to manage.
>>>> >
>>>> > HTH !
>>>> >
>>>> >
>>>> > On Thu, Sep 4, 2014 at 9:40 PM, JuanFra Rodriguez Cardoso <
>>>> [email protected]> wrote:
>>>> >>
>>>> >> Thanks Andrew for your quick response.
>>>> >>
>>>> >> My sources (server PUD) can't put events into an agregation point.
>>>> For this reason I'm following a PollingSource schema where my agent needs
>>>> to be configured with thousands of sources. Any clues for use cases where
>>>> data is injected considering a polling process?
>>>> >>
>>>> >> Regards!
>>>> >> ---
>>>> >> JuanFra Rodriguez Cardoso
>>>> >>
>>>> >>
>>>> >> 2014-09-04 17:41 GMT+02:00 Andrew Ehrlich <[email protected]>:
>>>> >>>
>>>> >>> One way to avoid managing so many sources would be to have an
>>>> aggregation point between the data generators the flume sources. For
>>>> example, maybe you could have the data generators put events into a message
>>>> queue(s), then have flume consume from there?
>>>> >>>
>>>> >>> Andrew
>>>> >>>
>>>> >>> ---- On Thu, 04 Sep 2014 08:29:04 -0700 JuanFra Rodriguez Cardoso<
>>>> [email protected]> wrote ----
>>>> >>>
>>>> >>> Hi all:
>>>> >>>
>>>> >>> Considering an environment with thousands of sources, which are the
>>>> best practices for managing the agent configuration (flume.conf)? Is it
>>>> recommended to create a multi-layer topology where each agent takes control
>>>> of a subset of sources?
>>>> >>>
>>>> >>> In that case, a conf mgmg server (such as Puppet) would be
>>>> responsible for editing flume.conf  with parameters 'agent.sources' from
>>>> source1 to source3000 (assuming we have 3000 sources machines).
>>>> >>>
>>>> >>> Are my thoughts aligned with that scenarios of large scale data
>>>> ingest?
>>>> >>>
>>>> >>> Thanks a lot!
>>>> >>> ---
>>>> >>> JuanFra
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> >
>>>> > --
>>>> > thanks
>>>> > ashish
>>>> >
>>>> > Blog: http://www.ashishpaliwal.com/blog
>>>> > My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>>
>>>
>>>
>>>
>>> --
>>> thanks
>>> ashish
>>>
>>> Blog: http://www.ashishpaliwal.com/blog
>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>>>
>>
>>
>
>
> --
> thanks
> ashish
>
> Blog: http://www.ashishpaliwal.com/blog
> My Photo Galleries: http://www.pbase.com/ashishpaliwal
>

Re: Collecting thousands of sources

Reply via email to