Hi Mohit You are right. If your smaller XML files are in hdfs then MR would be the best approach to combine it to a sequence file. It'd do the job in parallel.
Regards Bejoy.K.S On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia <mohitanch...@gmail.com>wrote: > Thanks! that helps. I am reading small xml files from external file system > and then writing to the SequenceFile. I made it stand alone client thinking > that mapreduce may not be the best way to do this type of writing. My > understanding was that map reduce is best suited for processing data within > HDFS. Is map reduce also one of the options I should consider? > > On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks <bejoy.had...@gmail.com> wrote: > > > Hi Mohit > > If you are using a stand alone client application to do the same > > definitely there is just one instance of the same running and you'd be > > writing the sequence file to one hdfs block at a time. Once it reaches > hdfs > > block size the writing continues to next block, in the mean time the > first > > block is replicated. If you are doing the same job distributed as map > > reduce you'd be writing to to n files at a time when n is the number of > > tasks in your map reduce job. > > AFAIK the data node where the blocks have to be placed is determined > > by hadoop it is not controlled by end user application. But if you are > > triggering the stand alone job on a particular data node and if it has > > space one replica would be stored in the same. Same applies in case of MR > > tasks as well. > > > > Regards > > Bejoy.K.S > > > > On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia <mohitanch...@gmail.com > > >wrote: > > > > > I have a client program that creates sequencefile, which essentially > > merges > > > small files into a big file. I was wondering how is sequence file > > splitting > > > the data accross nodes. When I start the sequence file is empty. Does > it > > > get split when it reaches the dfs.block size? If so then does it mean > > that > > > I am always writing to just one node at a given point in time? > > > > > > If I start a new client writing a new sequence file then is there a way > > to > > > select a different data node? > > > > > >