Re: Running a job continuously

Abhishek Pratap Singh Mon, 05 Dec 2011 16:04:40 -0800

Hi Burak,

The model of hadoop is very different, it is based on Job based model, in
more easy words its a kind of Batch model where map reduce job is executed
on a batch of data which is already present.
As per your requirement, word count example doesn't make sense if the file
has been written continuously.
However word count per hour or per min make sense in map reduce type of
program.
I second  what Bejoy has mentioned is to use flume, aggregate the data and
then run map reduce.
Hadoop can give you near real time by using flume with Map reduce where you
can run map reduce job on flume dumped data every few mins.
Second option is to see if your problem  can be solved by Flume Decorator
itself for real time experience.


Regards,
Abhishek

On Mon, Dec 5, 2011 at 2:33 PM, burakkk <burak.isi...@gmail.com> wrote:

> Athanasios Papaoikonomou, cron job isn't useful for me. Because i want to
> execute the MR job on the same algorithm but different files have different
> velocity.
>
> Both Storm and facebook's hadoop are designed for that. But i want to use
> apache distribution.
>
> Bejoy Ks, i have a continuous inflow of data but i think i need a near
> real-time system.
>
> Mike Spreitzer, both output and input are continuous. Output isn't relevant
> to the input. Only that i want is all the incoming files are processed by
> the same job and the same algorithm.
> For ex, you think about wordcount problem. When you want to run wordcount,
> you implement that:
> http://wiki.apache.org/hadoop/WordCount
>
> But when the program find that code "job.waitForCompletion(true);", somehow
> job will end up. When you want to make it continuously, what will you do in
> hadoop without other tools?
> One more thing is you assumption that the input file's name is
> filename_timestamp(filename_20111206_0030)
>
> public static void main(String[] args) throws Exception {    Configuration
> conf = new Configuration();                Job job = new Job(conf,
> "wordcount");        job.setOutputKeyClass(Text.class);
> job.setOutputValueClass(IntWritable.class);
> job.setMapperClass(Map.class);    job.setReducerClass(Reduce.class);
>  job.setInputFormatClass(TextInputFormat.class);
> job.setOutputFormatClass(TextOutputFormat.class);
> FileInputFormat.addInputPath(job, new Path(args[0]));
> FileOutputFormat.setOutputPath(job, new Path(args[1]));
> job.waitForCompletion(true); }
>
> On Mon, Dec 5, 2011 at 11:19 PM, Bejoy Ks <bejoy.had...@gmail.com> wrote:
>
> > Burak
> >        If you have a continuous inflow of data, you can choose flume to
> > aggregate the files into larger sequence files or so if they are small
> and
> > when you have a substantial chunk of data(equal to hdfs block size). You
> > can push that data on to hdfs based on your SLAs you need to schedule
> your
> > jobs using oozie or simpe shell script. In very simple terms
> > - push input data (could be from flume collector) into a staging hdfs dir
> > - before triggering the job(hadoop jar) copy the input from staging to
> > main input dir
> > - execute the job
> > - archive the input and output into archive dirs(any other dirs).
> >        - the output archive dir could be source of output data
> > - delete output dir and empty input dir
> >
> > Hope it helps!...
> >
> > Regards
> > Bejoy.K.S
> >
> > On Tue, Dec 6, 2011 at 2:19 AM, burakkk <burak.isi...@gmail.com> wrote:
> >
> >> Hi everyone,
> >> I want to run a MR job continuously. Because i have streaming data and i
> >> try to analyze it all the time in my way(algorithm). For example you
> want
> >> to solve wordcount problem. It's the simplest one :) If you have some
> >> multiple files and the new files are keep going, how do you handle it?
> >> You could execute a MR job per one file but you have to do it repeatly.
> So
> >> what do you think?
> >>
> >> Thanks
> >> Best regards...
> >>
> >> --
> >>
> >> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> >> *
> >> *
> >>
> >
> >
>
>
> --
>
> *BURAK ISIKLI** *| *http://burakisikli.wordpress.com*
> *
> *
>

Re: Running a job continuously

Reply via email to