Re: Distributed Deployment Questions

Jay Stricks Sat, 03 Mar 2012 13:37:05 -0800

Really, really appreciate the help, Alex.

1.  Max open files for root is at 65000 for all five collectors, but I'm
not sure what you want me to check with respect to network latency.  I
actually don't have any partitions marked as swap on these machines, as far
as I can tell with a 'swapon -s' command. So I'll look into making that
possible. We do have 7.5gb of RAM, but I'm not proving any UOPTS like Xms
or Xmx when I start Flume on the collectors.


The error that I indicated in the first post was occurring for only one of
the three flows going through the collectors. The translated configs for
that flow are:

*Collector*
rpcSource( 35853 )
collectorSink( "s3n://flume-data/events/month=%Y-%m/dt=%Y-%m-%d/hr=%k",
"ngn-app-events-" )

*Agent Type 1 *(8 servers)
Source: tail( "/var/log/httpd/access_log", "true" )
Source: syslogUdp( 5141 )
Source: syslogUdp( 5140 )

*Agent Type 2* (40 servers)
Source: tail( "/var/log/httpd/access_log", "true" )
Source: syslogUdp( 5140 )

Sinks for all of these are autoE2EChain, each with a different value()
decorator. Would it help to spread these over different flows?

3. Do I tune the max open connections setting in flume-site.xml? I assume I
should change maxClientCnxns, right? I wonder if globalOutstandingLimit
would also help. (Found these at
http://zookeeper.apache.org/doc/r3.3.1/zookeeperAdmin.html#sc_minimumConfiguration
).

Thanks again for the advice. I'm sure working through this is helping other
people too!

Jay



On Sat, Mar 3, 2012 at 2:36 AM, alo alt <wget.n...@googlemail.com> wrote:

> Hey Jay,
>
> 1. please check max open files, network latency, swap. Useful would be a
> example of sinks or flows.
>
> 2. Here it could be that S3 nodes fall behind and you're hitting different
> servers on S3
>
> 3. Flume master uses zookeeper, here you can tune the max open
> connections. In fact, when you use that feature, deploy one agents, sleep
> 3, next and so on. Thats one of the reasons flume hast the 3 sec sleep
> timer at the sysinit restart scripts
>
> best,
>  Alex
>
> --
> Alexander Lorenz
> http://mapredit.blogspot.com
>
> On Mar 2, 2012, at 11:45 PM, Jay Stricks wrote:
>
> > Hey folks,
> >
> > I'm looking for some advice on a couple of issues I"m having. My setup
> is Flume v.094--cdh3u2, single master, six collectors (three flows, all
> autoCollectorSource), ~80 agents (three flows, autoE2E).
> >
> > 1. I have begun to have collectors fail with "ERROR
> connector.DirectDriver: Exiting driver logicalNode <node_name> in error
> state ThriftEventSource | Collector because null", which looks very similar
> to the issue address in FLUME-757 (
> https://issues.apache.org/jira/browse/FLUME-757).  Any update/advice on
> how to address this? Is it an issue of limiting the size of the files being
> transmitted to the collectors, or the frequency of transmission? This never
> happened on 093, and it's a little concerning to see after upgrading.
> >
> > 2. I'm a constantly getting "WARN httpclient.RestS3Service: Response
> '/events%2Fmonth%3D2012-03%2Fdt%3D2012-03-01%2Fhr%3D16%2Fngn-app-events-20120302-154220414-0500.80204348000864.00000026.tmp'
> - Unexpected response code 404, expected 200", even though the data is
> being written to S3. I know this has been brought up before, but is there
> any advice on when to determine if it's a valid error?
> >
> > 3. My agents are on machines that are launched and terminated somewhat
> frequently due to maintenance, etc.  I have the user data scripts set up so
> that each agent server, upon being launched, starts a Flume shell, connects
> to the master, and executes its own configuration commands.  Often, my
> master will fail when too many agent configurations are being submitted.
> The number of threads grows exponentially at these times, and then fails.
> I'm curious if anyone else experiences this over-concurrency problem, or
> how you would recommend avoiding it. Any ideas for how to have the master
> 'notice' a new agent and execute its configuration itself, which seems like
> it would be an effective rate limiter, so to speak?
> >
> > Thanks a ton for the help!
> >
> > Jay S.
> >
> >
>
>

Re: Distributed Deployment Questions

Reply via email to