Ashish, Interesting enough, i was initially doing 1 too, and had a working version. But finally I give it up because in my bolt i have to flush to hdfs either when data reaching certain size or a timer times out, which is exactly what flume can offer. Also it has some complexity of grouping entries within the same partition while with flume it is a piece of cake.
Thank you so much for all you guys's input. It helped me a lot ! Chen On Thu, Jan 9, 2014 at 10:00 PM, Ashish <[email protected]> wrote: > Got it! > > My first reaction was to use HDFS bolt to write data directly to HDFS, but > couldn't find an implementation for the same. My knowledge is limited for > Storm. > If the data is already flowing through Storm, you got two options > 1. Write a bolt to dump data to HDFS > 2. Write a Flume bolt using RPC client as recommended in thread, and reuse > Flume's capabilities. > > If you already have Flume installation running, #2 is quickest way of > running. Otherwise also, installing and running Flume is like a walk in the > park :) > > You can also watch related discussion on > https://issues.apache.org/jira/browse/FLUME-1286. There is some good info > in the JIRA. > > thanks > ashish > > > > > On Fri, Jan 10, 2014 at 11:08 AM, Chen Wang <[email protected]>wrote: > >> Ashish, >> Since we already use storm for other real time processing, i thus want to >> re utilize it. The biggest advantage for me of using storm in this case is >> that i could use storm's spout to read from our socket server continuously, >> and the storm framework can ensure it never stops. Meantime, i can also >> easily filter out /translate the data in bolt before sending to flume. For >> this piece of data stream, right now my first step is to get it into hdfs, >> but will add real time processing soon. >> Does that make sense to you? >> Thanks, >> Chen >> >> >> On Thu, Jan 9, 2014 at 9:29 PM, Ashish <[email protected]> wrote: >> >>> Why do you need Storm? Are you doing any real time processing? If not, >>> IMHO, avoid Storm. >>> >>> Can use something like this >>> >>> Socket -> Load Balanced RPC Client -> Flume Topology with HA >>> >>> What Application level protocol are you using at Socket level? >>> >>> >>> On Fri, Jan 10, 2014 at 10:50 AM, Chen Wang >>> <[email protected]>wrote: >>> >>>> Jeff, Joao, >>>> Thanks for the pointer! >>>> I think i am getting close here: >>>> 1. set up a cluster of flume agent with redundancies, source as avro, >>>> sink as HDFS. >>>> 2 use storm(not quite necessary) to read from our socket server, then >>>> in the bolt, using flume client (load balancing rpc client) to send the >>>> event to the agent set up in step 1. >>>> >>>> Then I thus get all the benefit of storm and flume. Does this set up >>>> look right to you? >>>> thank you very much, >>>> Chen >>>> >>>> >>>> On Thu, Jan 9, 2014 at 8:58 PM, Joao Salcedo <[email protected]>wrote: >>>> >>>>> Hi Chen, >>>>> >>>>> Maybe it would be worth checking this >>>>> >>>>> http://flume.apache.org/FlumeDeveloperGuide.html#loadbalancing-rpc-client >>>>> >>>>> Regards, >>>>> >>>>> Joao >>>>> >>>>> >>>>> On Fri, Jan 10, 2014 at 3:50 PM, Jeff Lord <[email protected]> wrote: >>>>> >>>>>> Have you taken a look at the load balancing rpc client? >>>>>> >>>>>> >>>>>> On Thu, Jan 9, 2014 at 8:43 PM, Chen Wang <[email protected] >>>>>> > wrote: >>>>>> >>>>>>> Jeff, >>>>>>> I have read this ppt at the beginning, but didn't find solution to >>>>>>> my user case. To simplify my case, I only have 1 data source(composed >>>>>>> of 5 >>>>>>> socket server) and i am looking for a fault tolerant deployment of >>>>>>> flume, >>>>>>> that can read from this single data source and sink to hdfs in fault >>>>>>> tolerant mode: when one node dies, another flume node can pick up and >>>>>>> continue; >>>>>>> Thanks, >>>>>>> Chen >>>>>>> >>>>>>> >>>>>>> On Thu, Jan 9, 2014 at 7:49 PM, Jeff Lord <[email protected]>wrote: >>>>>>> >>>>>>>> Chen, >>>>>>>> >>>>>>>> Have you taken a look at this presentation on Planning and >>>>>>>> Deploying Flume from ApacheCon? >>>>>>>> >>>>>>>> >>>>>>>> http://archive.apachecon.com/na2013/presentations/27-Wednesday/Big_Data/11:45-Mastering_Sqoop_for_Data_Transfer_for_Big_Data-Arvind_Prabhakar/Arvind%20Prabhakar%20-%20Planning%20and%20Deploying%20Apache%20Flume.pdf >>>>>>>> >>>>>>>> It may have the answers you need. >>>>>>>> >>>>>>>> Best, >>>>>>>> >>>>>>>> Jeff >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jan 9, 2014 at 7:24 PM, Chen Wang < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Thanks Saurabh. >>>>>>>>> If that is the case, I am actually thinking about using storm >>>>>>>>> spout to talk to our socket server so that the storm cluster can take >>>>>>>>> care >>>>>>>>> of the reading socket server part. Then in each storm node, start a >>>>>>>>> flume >>>>>>>>> agent, listening on a RPC port and write to HDFS(with fail over) >>>>>>>>> .Then in >>>>>>>>> the storm bolt, simply send the data to RPC so that flume can get it. >>>>>>>>> How do you think of this setup? It takes care of both failover on >>>>>>>>> the source(by storm) and on the sink(by flume) But It looks a little >>>>>>>>> complicated for me. >>>>>>>>> Chen >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Jan 9, 2014 at 7:18 PM, Saurabh B < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi Chen, >>>>>>>>>> >>>>>>>>>> I think Flume doesn't have a way to configure multiple sources >>>>>>>>>> pointing to same data source. Of course you can do that, but you >>>>>>>>>> will end >>>>>>>>>> up with duplicate data. Flume offers fail over at the sink level. >>>>>>>>>> >>>>>>>>>> On Thu, Jan 9, 2014 at 6:56 PM, Chen Wang < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> Ok. so after more researching:) It seems that what i need is the >>>>>>>>>>> failover for agent source, (not fail over for sink): >>>>>>>>>>> If one agent dies, another same kind of agent will start running. >>>>>>>>>>> Does flume support this scenario? >>>>>>>>>>> Thanks, >>>>>>>>>>> Chen >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Thu, Jan 9, 2014 at 3:12 PM, Chen Wang < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> After reading more docs, it seems that if I want to achieve my >>>>>>>>>>>> goal, i have to do the following: >>>>>>>>>>>> 1. Having one agent with the custom source running on one node. >>>>>>>>>>>> This agent reads from those 5 socket server, and sink to some kind >>>>>>>>>>>> of >>>>>>>>>>>> sink(maybe another socket?) >>>>>>>>>>>> 2. On another(or more) machines, setting up collectors that >>>>>>>>>>>> read from the agent sink in 1, and sink to hdfs. >>>>>>>>>>>> 3. Having a master node managing nodes in 1,2. >>>>>>>>>>>> >>>>>>>>>>>> But it seems to be overskilled in my case: in 1, i can already >>>>>>>>>>>> sink to hdfs. Since the data available at socket server are much >>>>>>>>>>>> faster >>>>>>>>>>>> than the data translation part. I want to be able to later add >>>>>>>>>>>> more nodes >>>>>>>>>>>> to do the translation job. so what is the correct setup? >>>>>>>>>>>> Thanks, >>>>>>>>>>>> Chen >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Thu, Jan 9, 2014 at 2:38 PM, Chen Wang < >>>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Guys, >>>>>>>>>>>>> In my environment, the client is 5 socket servers. Thus i >>>>>>>>>>>>> wrote a custom source spawning 5 threads reading from each of them >>>>>>>>>>>>> infinitely,and the sink is hdfs(hive table). The work fine by >>>>>>>>>>>>> running flume-ng >>>>>>>>>>>>> agent. >>>>>>>>>>>>> >>>>>>>>>>>>> But how can i deploy this in distributed mode(cluster)? I am >>>>>>>>>>>>> confused about the 3 ties(agent,collector,storage) mentioned in >>>>>>>>>>>>> the doc. >>>>>>>>>>>>> Does it apply to my case? How can I separate my >>>>>>>>>>>>> agent/collect/storage? >>>>>>>>>>>>> Apparently i can only have one agent running: multiple agent will >>>>>>>>>>>>> result in >>>>>>>>>>>>> getting duplicates from the socket server. But I want that if one >>>>>>>>>>>>> agent >>>>>>>>>>>>> dies, other agent can take it up. I would also like to be able to >>>>>>>>>>>>> add >>>>>>>>>>>>> horizontal scalability for writing to hdfs. How can I achieve all >>>>>>>>>>>>> this? >>>>>>>>>>>>> >>>>>>>>>>>>> thank you very much for your advice. >>>>>>>>>>>>> Chen >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Mailing List Archives, >>>>>>>>>> QnaList.com >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >>> >>> -- >>> thanks >>> ashish >>> >>> Blog: http://www.ashishpaliwal.com/blog >>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>> >> >> > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal >
