Re: Writing small files to one big file in hdfs

Mohit Anchlia Tue, 21 Feb 2012 11:38:48 -0800

Thanks How does mapreduce work on sequence file? Is there an example I can
look at?


On Tue, Feb 21, 2012 at 11:34 AM, Arko Provo Mukherjee <
arkoprovomukher...@gmail.com> wrote:

> Hi,
>
> Let's say all the smaller files are in the same directory.
>
> Then u can do:
>
> *BufferedWriter output = new BufferedWriter
> (newOutputStreamWriter(fs.create(output_path,
> true)));  // Output path*
>
> *FileStatus[] output_files = fs.listStatus(new Path(input_path));  // Input
> directory*
>
> *for ( int i=0; i < output_files.length; i++ )  *
>
> *{*
>
> *   BufferedReader reader = new
> BufferedReader(newInputStreamReader(fs.open(output_files[i].getPath())));
> *
>
> *   String data;*
>
> *   data = reader.readLine();*
>
> *   while ( data != null ) *
>
> *  {*
>
> *        output.write(data);*
>
> *  }*
>
> *    reader.close*
>
> *}*
>
> *output.close*
>
>
> In case you have the files in multiple directories, call the code for each
> of them with different input paths.
>
> Hope this helps!
>
> Cheers
>
> Arko
>
> On Tue, Feb 21, 2012 at 1:27 PM, Mohit Anchlia <mohitanch...@gmail.com
> >wrote:
>
> > I am trying to look for examples that demonstrates using sequence files
> > including writing to it and then running mapred on it, but unable to find
> > one. Could you please point me to some examples of sequence files?
> >
> > On Tue, Feb 21, 2012 at 10:25 AM, Bejoy Ks <bejoy.had...@gmail.com>
> wrote:
> >
> > > Hi Mohit
> > >      AFAIK XMLLoader in pig won't be suited for Sequence Files. Please
> > > post the same to Pig user group for some workaround over the same.
> > >         SequenceFIle is a preferred option when we want to store small
> > > files in hdfs and needs to be processed by MapReduce as it stores data
> in
> > > key value format.Since SequenceFileInputFormat is available at your
> > > disposal you don't need any custom input formats for processing the
> same
> > > using map reduce. It is a cleaner and better approach compared to just
> > > appending small xml file contents into a big file.
> > >
> > > On Tue, Feb 21, 2012 at 11:00 PM, Mohit Anchlia <
> mohitanch...@gmail.com
> > > >wrote:
> > >
> > > > On Tue, Feb 21, 2012 at 9:25 AM, Bejoy Ks <bejoy.had...@gmail.com>
> > > wrote:
> > > >
> > > > > Mohit
> > > > >       Rather than just appending the content into a normal text
> file
> > or
> > > > > so, you can create a sequence file with the individual smaller file
> > > > content
> > > > > as values.
> > > > >
> > > > >  Thanks. I was planning to use pig's
> > > > org.apache.pig.piggybank.storage.XMLLoader
> > > > for processing. Would it work with sequence file?
> > > >
> > > > This text file that I was referring to would be in hdfs itself. Is it
> > > still
> > > > different than using sequence file?
> > > >
> > > > > Regards
> > > > > Bejoy.K.S
> > > > >
> > > > > On Tue, Feb 21, 2012 at 10:45 PM, Mohit Anchlia <
> > > mohitanch...@gmail.com
> > > > > >wrote:
> > > > >
> > > > > > We have small xml files. Currently I am planning to append these
> > > small
> > > > > > files to one file in hdfs so that I can take advantage of splits,
> > > > larger
> > > > > > blocks and sequential IO. What I am unsure is if it's ok to
> append
> > > one
> > > > > file
> > > > > > at a time to this hdfs file
> > > > > >
> > > > > > Could someone suggest if this is ok? Would like to know how other
> > do
> > > > it.
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Writing small files to one big file in hdfs

Reply via email to