Re: seeking help on flume cluster deployment

Ashish Thu, 09 Jan 2014 21:31:15 -0800

Why do you need Storm? Are you doing any real time processing? If not,
IMHO, avoid Storm.


Can use something like this

Socket -> Load Balanced RPC Client -> Flume Topology with HA

What Application level protocol are you using at Socket level?


On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang <[email protected]>wrote:

> Jeff, Joao,
> Thanks for the pointer!
> I think i am getting close here:
> 1. set up a cluster of flume agent with redundancies, source as avro, sink
> as HDFS.
> 2 use storm(not quite necessary) to read from our socket server, then in
> the bolt, using flume client (load balancing rpc client) to send the event
> to the agent set up in step 1.
>
> Then I thus get all the benefit of storm and flume. Does this set up look
> right to you?
> thank you very much,
> Chen
>
>
> On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <[email protected]>wrote:
>
>> Hi Chen,
>>
>> Maybe it would be worth checking this
>> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client
>>
>> Regards,
>>
>> Joao
>>
>>
>> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <[email protected]> wrote:
>>
>>> Have you taken a look at the load balancing rpc client?
>>>
>>>
>>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <[email protected]>wrote:
>>>
>>>> Jeff,
>>>> I have read this ppt at the beginning, but didn't find solution to my
>>>> user case. To simplify my case, I only have 1 data source(composed of 5
>>>> socket server)  and i am looking for a fault tolerant deployment of flume,
>>>> that can read from this single data source and sink to hdfs in fault
>>>> tolerant mode: when one node dies, another flume node can pick up and
>>>> continue;
>>>> Thanks,
>>>> Chen
>>>>
>>>>
>>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <[email protected]> wrote:
>>>>
>>>>> Chen,
>>>>>
>>>>> Have you taken a look at this presentation on Planning and Deploying
>>>>> Flume from ApacheCon?
>>>>>
>>>>>
>>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf
>>>>>
>>>>> It may have the answers you need.
>>>>>
>>>>> Best,
>>>>>
>>>>> Jeff
>>>>>
>>>>>
>>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Thanks Saurabh.
>>>>>> If that is the case, I am actually thinking about using storm spout
>>>>>> to talk to our socket server so that the storm cluster can take care of 
>>>>>> the
>>>>>> reading socket server part. Then in each storm node, start a flume agent,
>>>>>> listening on a RPC port and write to HDFS(with fail over) .Then in the
>>>>>> storm bolt, simply send the data to RPC so that flume can get it.
>>>>>> How do you think of this setup? It takes care of both failover on the
>>>>>> source(by storm) and on the sink(by flume) But It looks a little
>>>>>> complicated for me.
>>>>>> Chen
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B 
>>>>>> <[email protected]>wrote:
>>>>>>
>>>>>>> Hi Chen,
>>>>>>>
>>>>>>> I think Flume doesn't have a way to configure multiple sources
>>>>>>> pointing to same data source. Of course you can do that, but you will 
>>>>>>> end
>>>>>>> up with duplicate data. Flume offers fail over at the sink level.
>>>>>>>
>>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Ok. so after more researching:) It seems that what i need is the
>>>>>>>> failover for agent source, (not fail over for sink):
>>>>>>>> If one agent dies, another same kind of agent will start running.
>>>>>>>> Does flume support this scenario?
>>>>>>>> Thanks,
>>>>>>>> Chen
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> After reading more docs, it seems that if I want to achieve my
>>>>>>>>> goal, i have to do the following:
>>>>>>>>> 1. Having one agent with the custom source running on one node.
>>>>>>>>> This agent reads from those 5 socket server, and sink to some kind of
>>>>>>>>> sink(maybe another socket?)
>>>>>>>>> 2. On another(or more) machines, setting up collectors that read
>>>>>>>>> from the agent sink in 1, and sink to hdfs.
>>>>>>>>> 3. Having a master node managing nodes in 1,2.
>>>>>>>>>
>>>>>>>>> But it seems to be overskilled in my case: in 1, i can already
>>>>>>>>> sink to hdfs. Since the data available at socket server are much 
>>>>>>>>> faster
>>>>>>>>> than the data translation part.  I want to be able to later add more 
>>>>>>>>> nodes
>>>>>>>>> to do the translation job. so what is the correct setup?
>>>>>>>>> Thanks,
>>>>>>>>> Chen
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Guys,
>>>>>>>>>> In my environment, the client is 5 socket servers. Thus i wrote a
>>>>>>>>>> custom source spawning 5 threads reading from each of them 
>>>>>>>>>> infinitely,and
>>>>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng
>>>>>>>>>> agent.
>>>>>>>>>>
>>>>>>>>>> But how can i deploy this in distributed mode(cluster)? I am
>>>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in the 
>>>>>>>>>> doc.
>>>>>>>>>> Does it apply to my case? How can I separate my 
>>>>>>>>>> agent/collect/storage?
>>>>>>>>>> Apparently i can only have one agent running: multiple agent will 
>>>>>>>>>> result in
>>>>>>>>>> getting duplicates from the socket server. But I want that if one 
>>>>>>>>>> agent
>>>>>>>>>> dies, other agent can take it up. I would also like to be able to add
>>>>>>>>>> horizontal scalability for writing to hdfs. How can I achieve all 
>>>>>>>>>> this?
>>>>>>>>>>
>>>>>>>>>> thank you very much for your advice.
>>>>>>>>>> Chen
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Mailing List Archives,
>>>>>>> QnaList.com
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
thanks
ashish

Blog: http://www.ashishpaliwal.com/blog
My Photo Galleries: http://www.pbase.com/ashishpaliwal

Re: seeking help on flume cluster deployment

Reply via email to