Why do you need Storm? Are you doing any real time processing? If not, IMHO, avoid Storm.
Can use something like this Socket -> Load Balanced RPC Client -> Flume Topology with HA What Application level protocol are you using at Socket level? On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang <[email protected]>wrote: > Jeff, Joao, > Thanks for the pointer! > I think i am getting close here: > 1. set up a cluster of flume agent with redundancies, source as avro, sink > as HDFS. > 2 use storm(not quite necessary) to read from our socket server, then in > the bolt, using flume client (load balancing rpc client) to send the event > to the agent set up in step 1. > > Then I thus get all the benefit of storm and flume. Does this set up look > right to you? > thank you very much, > Chen > > > On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <[email protected]>wrote: > >> Hi Chen, >> >> Maybe it would be worth checking this >> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client >> >> Regards, >> >> Joao >> >> >> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <[email protected]> wrote: >> >>> Have you taken a look at the load balancing rpc client? >>> >>> >>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <[email protected]>wrote: >>> >>>> Jeff, >>>> I have read this ppt at the beginning, but didn't find solution to my >>>> user case. To simplify my case, I only have 1 data source(composed of 5 >>>> socket server) and i am looking for a fault tolerant deployment of flume, >>>> that can read from this single data source and sink to hdfs in fault >>>> tolerant mode: when one node dies, another flume node can pick up and >>>> continue; >>>> Thanks, >>>> Chen >>>> >>>> >>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <[email protected]> wrote: >>>> >>>>> Chen, >>>>> >>>>> Have you taken a look at this presentation on Planning and Deploying >>>>> Flume from ApacheCon? >>>>> >>>>> >>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf >>>>> >>>>> It may have the answers you need. >>>>> >>>>> Best, >>>>> >>>>> Jeff >>>>> >>>>> >>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang >>>>> <[email protected]>wrote: >>>>> >>>>>> Thanks Saurabh. >>>>>> If that is the case, I am actually thinking about using storm spout >>>>>> to talk to our socket server so that the storm cluster can take care of >>>>>> the >>>>>> reading socket server part. Then in each storm node, start a flume agent, >>>>>> listening on a RPC port and write to HDFS(with fail over) .Then in the >>>>>> storm bolt, simply send the data to RPC so that flume can get it. >>>>>> How do you think of this setup? It takes care of both failover on the >>>>>> source(by storm) and on the sink(by flume) But It looks a little >>>>>> complicated for me. >>>>>> Chen >>>>>> >>>>>> >>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B >>>>>> <[email protected]>wrote: >>>>>> >>>>>>> Hi Chen, >>>>>>> >>>>>>> I think Flume doesn't have a way to configure multiple sources >>>>>>> pointing to same data source. Of course you can do that, but you will >>>>>>> end >>>>>>> up with duplicate data. Flume offers fail over at the sink level. >>>>>>> >>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang < >>>>>>> [email protected]> wrote: >>>>>>> >>>>>>>> Ok. so after more researching:) It seems that what i need is the >>>>>>>> failover for agent source, (not fail over for sink): >>>>>>>> If one agent dies, another same kind of agent will start running. >>>>>>>> Does flume support this scenario? >>>>>>>> Thanks, >>>>>>>> Chen >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> After reading more docs, it seems that if I want to achieve my >>>>>>>>> goal, i have to do the following: >>>>>>>>> 1. Having one agent with the custom source running on one node. >>>>>>>>> This agent reads from those 5 socket server, and sink to some kind of >>>>>>>>> sink(maybe another socket?) >>>>>>>>> 2. On another(or more) machines, setting up collectors that read >>>>>>>>> from the agent sink in 1, and sink to hdfs. >>>>>>>>> 3. Having a master node managing nodes in 1,2. >>>>>>>>> >>>>>>>>> But it seems to be overskilled in my case: in 1, i can already >>>>>>>>> sink to hdfs. Since the data available at socket server are much >>>>>>>>> faster >>>>>>>>> than the data translation part. I want to be able to later add more >>>>>>>>> nodes >>>>>>>>> to do the translation job. so what is the correct setup? >>>>>>>>> Thanks, >>>>>>>>> Chen >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Guys, >>>>>>>>>> In my environment, the client is 5 socket servers. Thus i wrote a >>>>>>>>>> custom source spawning 5 threads reading from each of them >>>>>>>>>> infinitely,and >>>>>>>>>> the sink is hdfs(hive table). The work fine by running flume-ng >>>>>>>>>> agent. >>>>>>>>>> >>>>>>>>>> But how can i deploy this in distributed mode(cluster)? I am >>>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in the >>>>>>>>>> doc. >>>>>>>>>> Does it apply to my case? How can I separate my >>>>>>>>>> agent/collect/storage? >>>>>>>>>> Apparently i can only have one agent running: multiple agent will >>>>>>>>>> result in >>>>>>>>>> getting duplicates from the socket server. But I want that if one >>>>>>>>>> agent >>>>>>>>>> dies, other agent can take it up. I would also like to be able to add >>>>>>>>>> horizontal scalability for writing to hdfs. How can I achieve all >>>>>>>>>> this? >>>>>>>>>> >>>>>>>>>> thank you very much for your advice. >>>>>>>>>> Chen >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Mailing List Archives, >>>>>>> QnaList.com >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >> > -- thanks ashish Blog: http://www.ashishpaliwal.com/blog My Photo Galleries: http://www.pbase.com/ashishpaliwal
