Yeah, I realize that. The reason I think it should be somewhat dependent upon FLUME-1491 is that ZooKeeper seems to me to be a pretty heavy-weight requirement just to use a particular source. FLUME-1491 would make Flume generally dependent upon ZooKeeper, which is a good transition point to start using ZK for other state that would be necessary for Flume components. Would you agree?
On Sun, Aug 10, 2014 at 11:35 PM, Ashish <[email protected]> wrote: > Seems like a bit of confusion here. Flume-1491 only deals with > configuration part, nothing else. Even if it get integrated, you would > still need to write/expose API to store meta-data info in Zk (Flume-1491 > doesn't bring that in). > > HTH ! > > > On Mon, Aug 11, 2014 at 11:39 AM, Jonathan Natkins <[email protected]> > wrote: > >> Given that FLUME-1491 hasn't been committed yet, and may still be a ways >> away, does it seem reasonable to punt on having multiple sources working >> off of a single bucket until ZK is integrated into Flume? The alternative >> probably requires write access to the S3 bucket to record some shared >> state, and would likely have to get rewritten once ZK integration happens >> anyway. >> >> >> On Tue, Aug 5, 2014 at 10:07 PM, Paweł Róg <[email protected]> wrote: >> >>> Hi, >>> >>> I think that it is not possible to simply use SpoolDirectorySource. >>> Maybe it will be possible to use some elements of SpoolDirectory but >>> without touching it's code I think SpoolDirectory is not a good base. At >>> the very beginning SpoolDirectorySource does this: >>> >>> File directory = new File(spoolDirectory); >>> >>> ReliableSpoolingFileEventReader also instantiate File class. >>> There is also a question. How ReliableSpoolingFileEventReader stores >>> information about files that has been already processed in non-Deleting >>> mode? What happens after Flume restart? >>> >>> I agree with Jonathan that S3 source should be able to store last >>> processed file eg. in Zookeeper. >>> Another thing Jonathan: I think you shouldn't care about multiple >>> buckets processed handled by a single S3Source. As you wrote multiple >>> sources is the solution here. I thought it was already discussed but maybe >>> I'm wrong. >>> >>> >>> >> 2. Is it fair to assume that we're dealing with character files, >>> rather than binary objects? >>> >>> In my opinion S3 source can by default read file as simple text file but >>> also take in configuration a parameter with class name of a "InputStream >>> processor". This processor will we able to eg. unzip, deserialize avro or >>> read JSON and convert it into log events. What do you think? >>> >>> -- >>> Paweł Róg >>> >>> 2014-08-06 5:12 GMT+02:00 Viral Bajaria <[email protected]>: >>> >>> Agree to the feedback provided by Ashish. >>>> >>>> I have started writing one which is similar to the ExecSource, but I >>>> like the idea of doing something where spooldir takes over most of the hard >>>> work of spitting out events to sinks. Let me think more on how to structure >>>> that. >>>> >>>> Quick thinking out loud, I could create a source which extends the >>>> spooldir and just spins off a thread to manage moving things from S3 to the >>>> spooldir via a temporary directory. >>>> >>>> Regarding maintaining metadata, there are 2 ways: >>>> 1) DB: I currently maintain it in a database because there are a lot of >>>> other tools build around it >>>> 2) File: Just keep the info in memory and in file to help from crash >>>> recovery and/or high memory usage. >>>> >>>> Thanks, >>>> Viral >>>> >>>> >>>> >>>> >>>> On Tue, Aug 5, 2014 at 8:04 PM, Ashish <[email protected]> wrote: >>>> >>>>> Sharing some random thoughts >>>>> >>>>> 1. Download the file using S3 SDK and let the SpoolDirectory >>>>> implementation take care of rest. Like a Decorator in front of >>>>> SpoolDirectory >>>>> >>>>> 2. Use S3 SDK to create InputStream of S3 objects directly in code and >>>>> create events out of it. >>>>> >>>>> Would be great to reuse an existing implementation which is based on >>>>> InputStream and feed it with S3 object input stream, concern of metadata >>>>> storage still remains. Most often S3 objects are stored in compressed >>>>> form, >>>>> so this source would need to take care of compression gz/avro/others. >>>>> >>>>> Best is to start with something that works and then start adding more >>>>> features to it. >>>>> >>>>> >>>>> On Wed, Aug 6, 2014 at 2:27 AM, Jonathan Natkins <[email protected] >>>>> > wrote: >>>>> >>>>>> Hi all, >>>>>> >>>>>> I started trying to write some code on this, and realized there are a >>>>>> number of issues that need to be discussed in order to really design this >>>>>> feature effectively. The requirements that have been discussed thus far >>>>>> are: >>>>>> >>>>>> 1. Fetching data from S3 periodically >>>>>> 2. Fetching data from multiple S3 buckets -- This may be something >>>>>> that should be punted on until later. For a first implementation, this >>>>>> could be solved just by having multiple sources, each with a single S3 >>>>>> bucket >>>>>> 3. Associating an S3 bucket with a user/token/key -- *Otis - can you >>>>>> clarify what you mean by this?* >>>>>> 4. Dynamically reconfigure the source -- This is blocked by >>>>>> FLUME-1491, so I think this is out-of-scope for discussions at the moment >>>>>> >>>>>> Some questions I want to try to answer: >>>>>> >>>>>> 1. How do we identify and track objects that need to be processed >>>>>> versus objects that have been processed already? >>>>>> 1a. What about if we want to have multiple sources working against >>>>>> the same bucket to speed processing? >>>>>> 2. Is it fair to assume that we're dealing with character files, >>>>>> rather than binary objects? >>>>>> >>>>>> For the first question, if we ignore the multiple source extension >>>>>> of the question, I think the simplest answer is to do something on the >>>>>> local filesystem, like have a tracking directory that contains a list of >>>>>> to-be-processed objects and a list of already-processed objects. However, >>>>>> if the source goes down, what should the restart semantics be? It seems >>>>>> that the ideal situation is to store this state in a system like >>>>>> ZooKeeper, >>>>>> which would ensure that a number of sources could operate off of the same >>>>>> bucket, but this probably requires FLUME-1491 first. >>>>>> >>>>>> For the second question, my feeling was just that we should work with >>>>>> similar assumptions to how the SpoolingDirectorySource works, where each >>>>>> line is a separate event. Does that seem reasonable? >>>>>> >>>>>> Thanks, >>>>>> Natty >>>>>> >>>>>> >>>>>> On Fri, Aug 1, 2014 at 11:31 AM, Paweł <[email protected]> wrote: >>>>>> >>>>>>> Hi, >>>>>>> Thanks for explanation Jonathan. I think I will also start working >>>>>>> on it. When you have any patch (even draft) I'd be glad if you can >>>>>>> attach >>>>>>> it in JIRA. I'll do the same. >>>>>>> What do you think? >>>>>>> >>>>>>> -- >>>>>>> Paweł Róg >>>>>>> >>>>>>> 2014-08-01 20:19 GMT+02:00 Hari Shreedharan < >>>>>>> [email protected]>: >>>>>>> >>>>>>> +1 on an S3 Source. I would gladly review. >>>>>>>> >>>>>>>> Jonathan Natkins wrote: >>>>>>>> >>>>>>>> >>>>>>>> Hey Pawel, >>>>>>>> >>>>>>>> My intention is to start working on it, but I don't know exactly >>>>>>>> how >>>>>>>> long it will take, and I'm not a committer, so time estimates would >>>>>>>> have to be taken with a grain of salt regardless. If this is >>>>>>>> something >>>>>>>> that you need urgently, it may not be ideal to wait for me to start >>>>>>>> building something for yourself. >>>>>>>> >>>>>>>> That said, as mentioned in the other thread, dynamic configuration >>>>>>>> can >>>>>>>> be done by refreshing the configuration files across the set of >>>>>>>> Flume >>>>>>>> agents. It's certainly not as great as having a single place to >>>>>>>> change >>>>>>>> it (e.g. ZooKeeper), but it's a way to get the job done. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Natty >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Aug 1, 2014 at 1:33 AM, Paweł <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> Jonathan how should we interpret your last e-mail? You opened an >>>>>>>> JIRA issue and want to start implementing this and do you have >>>>>>>> any >>>>>>>> estimate how long it will take? >>>>>>>> >>>>>>>> I think the biggest challenge here is to have dynamic >>>>>>>> configuration of Flume. It doesn't seem to be part of FLUME-2437 >>>>>>>> issue. Am I right? >>>>>>>> >>>>>>>> > Would you need to be able to pull files from multiple S3 >>>>>>>> directories with the same source? >>>>>>>> >>>>>>>> I think we don't need to track multiple S3 buckets with a single >>>>>>>> source. I just imagine an approach where each S3 source can be >>>>>>>> added or deleted on demand and attached to any Channel. I'm only >>>>>>>> afraid about this dynamic configuration. I'll open a new thread >>>>>>>> about this. It seems we have two totally separate things: >>>>>>>> * build S3 source >>>>>>>> * make flume configurable dynamically >>>>>>>> >>>>>>>> -- >>>>>>>> Paweł >>>>>>>> >>>>>>>> >>>>>>>> 2014-08-01 9:51 GMT+02:00 Otis Gospodnetic >>>>>>>> <[email protected] <mailto:[email protected] >>>>>>>> >>: >>>>>>>> >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Fri, Aug 1, 2014 at 4:52 AM, Jonathan Natkins >>>>>>>> <[email protected] <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> Hey all, >>>>>>>> >>>>>>>> I created a JIRA for this: >>>>>>>> https://issues.apache.org/jira/browse/FLUME-2437 >>>>>>>> >>>>>>>> >>>>>>>> Thanks! Should Fix Version be set to the next Flume release >>>>>>>> version? >>>>>>>> >>>>>>>> I thought I'd start working on one myself, which can >>>>>>>> hopefully be contributed back. I'm curious: do you have >>>>>>>> particular requirements? Based on the emails in this >>>>>>>> thread, it sounds like the original goal was to have >>>>>>>> something that's like a SpoolDirectorySource that just >>>>>>>> picks up new files from S3. Is that accurate? >>>>>>>> >>>>>>>> >>>>>>>> Yes, I think so. We need to be able to: >>>>>>>> * fetch data (logs for pulling them in Logsene >>>>>>>> <http://sematext.com/logsene/>) from S3 periodically (e.g. >>>>>>>> >>>>>>>> every 1 min, every 5 min, etc.) >>>>>>>> * fetch data from multiple S3 buckets >>>>>>>> * associate an S3 bucket with a user/token/key >>>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>>> stored on disk) add new S3 buckets from which data should >>>>>>>> be fetch >>>>>>>> * dynamically (i.e. without editing/writing config files >>>>>>>> stored on disk) stop fetching data from some S3 buckets >>>>>>>> >>>>>>>> >>>>>>>> Would you need to be able to pull files from multiple S3 >>>>>>>> directories with the same source? >>>>>>>> >>>>>>>> >>>>>>>> I think the above addresses this question. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Natty >>>>>>>> >>>>>>>> >>>>>>>> Thanks! >>>>>>>> >>>>>>>> Otis >>>>>>>> -- >>>>>>>> Performance Monitoring * Log Analytics * Search Analytics >>>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 31, 2014 at 4:58 PM, Otis Gospodnetic >>>>>>>> <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> +1 for seeing S3Source, starting with a JIRA issue. >>>>>>>> >>>>>>>> But being able to dynamically add/remove S3 buckets >>>>>>>> from which to pull data seems important. >>>>>>>> >>>>>>>> Any suggestions for how to approach that? >>>>>>>> >>>>>>>> Otis >>>>>>>> -- >>>>>>>> Performance Monitoring * Log Analytics * Search >>>>>>>> Analytics >>>>>>>> Solr & Elasticsearch Support * http://sematext.com/ >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 31, 2014 at 9:14 PM, Hari Shreedharan >>>>>>>> <[email protected] >>>>>>>> <mailto:[email protected]>> wrote: >>>>>>>> >>>>>>>> Please go ahead and file a jira. If you are >>>>>>>> willing to submit a patch, you can post it on >>>>>>>> the >>>>>>>> jira. >>>>>>>> >>>>>>>> Viral Bajaria wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> I have a similar use case that cropped up >>>>>>>> yesterday. I saw the archive >>>>>>>> and found that there was a recommendation to >>>>>>>> build it as Sharninder >>>>>>>> suggested. >>>>>>>> >>>>>>>> For now, I went down the route of writing a >>>>>>>> python script which >>>>>>>> downloads from S3 and puts the files in a >>>>>>>> directory which is >>>>>>>> configured to be picked up via a spooldir. >>>>>>>> >>>>>>>> I would prefer to get a direct S3 source, and >>>>>>>> maybe we could >>>>>>>> collaborate on it and open-source it. Let me >>>>>>>> know >>>>>>>> if you prefer that >>>>>>>> and we can work directly on it by creating a >>>>>>>> JIRA. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Viral >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 31, 2014 at 10:26 AM, Hari >>>>>>>> Shreedharan >>>>>>>> <[email protected] >>>>>>>> <mailto:[email protected]> >>>>>>>> <mailto:[email protected] >>>>>>>> >>>>>>>> <mailto:[email protected]>>> wrote: >>>>>>>> >>>>>>>> In both cases, Sharninder is right :) >>>>>>>> >>>>>>>> Sharninder wrote: >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> As far as I know, there is no (open source) >>>>>>>> implementation of an S3 >>>>>>>> source, so yes, you'll have to implement >>>>>>>> your own. You'll have to >>>>>>>> implement a Pollable source and the dev >>>>>>>> documentation has an outline >>>>>>>> that you can use. You can also look at the >>>>>>>> existing Execsource and >>>>>>>> work your way up. >>>>>>>> >>>>>>>> As far as I know, there is no way to >>>>>>>> configure flume without >>>>>>>> using the >>>>>>>> configuration file. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Jul 31, 2014 at 7:57 PM, Paweł >>>>>>>> <[email protected] <mailto:[email protected]> >>>>>>>> <mailto:[email protected] <mailto: >>>>>>>> [email protected]>> >>>>>>>> <mailto:[email protected] >>>>>>>> <mailto:[email protected]> >>>>>>>> <mailto:[email protected] >>>>>>>> <mailto:[email protected]>>>> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> I'm wondering if Flume is able to read >>>>>>>> directly from S3. >>>>>>>> >>>>>>>> I'll describe my case. I have log files >>>>>>>> stored in AWS S3. I have >>>>>>>> to fetch periodically new S3 objects and >>>>>>>> read log lines from it. >>>>>>>> Than use log lines (events) are >>>>>>>> processed in standard flume's way >>>>>>>> (as with other sources). >>>>>>>> >>>>>>>> *1) Is there any way to fetch S3 objects >>>>>>>> or I have to write >>>>>>>> my own >>>>>>>> Source?* >>>>>>>> >>>>>>>> >>>>>>>> There is also second case. I want to >>>>>>>> have flume configuration >>>>>>>> dynamic. Flume sources can change in >>>>>>>> time. New AWS key and S3 >>>>>>>> bucket can be added or deleted. >>>>>>>> >>>>>>>> *2) Is there any other way to configure >>>>>>>> Flume than by static >>>>>>>> configuration file?* >>>>>>>> >>>>>>>> -- >>>>>>>> Paweł Róg >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> thanks >>>>> ashish >>>>> >>>>> Blog: http://www.ashishpaliwal.com/blog >>>>> My Photo Galleries: http://www.pbase.com/ashishpaliwal >>>>> >>>> >>>> >>> >> > > > -- > thanks > ashish > > Blog: http://www.ashishpaliwal.com/blog > My Photo Galleries: http://www.pbase.com/ashishpaliwal >
