Re: SequenceFile split question

2012-03-15 Thread Bejoy Ks
Hi Mohit
  If you are using a stand alone client application to do the same
definitely there is just one instance of the same running and you'd be
writing the sequence file to one hdfs block at a time. Once it reaches hdfs
block size the writing continues to next block, in the mean time the first
block is replicated. If you are doing the same job distributed as map
reduce you'd be writing to to n files at a time when n is the number of
tasks in your map reduce job.
 AFAIK the data node where the blocks have to be placed is determined
by hadoop it is not controlled by end user application. But if you are
triggering the stand alone job on a particular data node and if it has
space one replica would be stored in the same. Same applies in case of MR
tasks as well.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.comwrote:

 I have a client program that creates sequencefile, which essentially merges
 small files into a big file. I was wondering how is sequence file splitting
 the data accross nodes. When I start the sequence file is empty. Does it
 get split when it reaches the dfs.block size? If so then does it mean that
 I am always writing to just one node at a given point in time?

 If I start a new client writing a new sequence file then is there a way to
 select a different data node?



Re: SequenceFile split question

2012-03-15 Thread Mohit Anchlia
Thanks! that helps. I am reading small xml files from external file system
and then writing to the SequenceFile. I made it stand alone client thinking
that mapreduce may not be the best way to do this type of writing. My
understanding was that map reduce is best suited for processing data within
HDFS. Is map reduce also one of the options I should consider?

On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

 Hi Mohit
  If you are using a stand alone client application to do the same
 definitely there is just one instance of the same running and you'd be
 writing the sequence file to one hdfs block at a time. Once it reaches hdfs
 block size the writing continues to next block, in the mean time the first
 block is replicated. If you are doing the same job distributed as map
 reduce you'd be writing to to n files at a time when n is the number of
 tasks in your map reduce job.
 AFAIK the data node where the blocks have to be placed is determined
 by hadoop it is not controlled by end user application. But if you are
 triggering the stand alone job on a particular data node and if it has
 space one replica would be stored in the same. Same applies in case of MR
 tasks as well.

 Regards
 Bejoy.K.S

 On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com
 wrote:

  I have a client program that creates sequencefile, which essentially
 merges
  small files into a big file. I was wondering how is sequence file
 splitting
  the data accross nodes. When I start the sequence file is empty. Does it
  get split when it reaches the dfs.block size? If so then does it mean
 that
  I am always writing to just one node at a given point in time?
 
  If I start a new client writing a new sequence file then is there a way
 to
  select a different data node?
 



Re: SequenceFile split question

2012-03-15 Thread Bejoy Ks
Hi Mohit
 You are right. If your smaller XML files are in hdfs then MR would be
the best approach to combine it to a sequence file. It'd do the job
in parallel.

Regards
Bejoy.K.S

On Thu, Mar 15, 2012 at 8:17 PM, Mohit Anchlia mohitanch...@gmail.comwrote:

 Thanks! that helps. I am reading small xml files from external file system
 and then writing to the SequenceFile. I made it stand alone client thinking
 that mapreduce may not be the best way to do this type of writing. My
 understanding was that map reduce is best suited for processing data within
 HDFS. Is map reduce also one of the options I should consider?

 On Thu, Mar 15, 2012 at 2:15 AM, Bejoy Ks bejoy.had...@gmail.com wrote:

  Hi Mohit
   If you are using a stand alone client application to do the same
  definitely there is just one instance of the same running and you'd be
  writing the sequence file to one hdfs block at a time. Once it reaches
 hdfs
  block size the writing continues to next block, in the mean time the
 first
  block is replicated. If you are doing the same job distributed as map
  reduce you'd be writing to to n files at a time when n is the number of
  tasks in your map reduce job.
  AFAIK the data node where the blocks have to be placed is determined
  by hadoop it is not controlled by end user application. But if you are
  triggering the stand alone job on a particular data node and if it has
  space one replica would be stored in the same. Same applies in case of MR
  tasks as well.
 
  Regards
  Bejoy.K.S
 
  On Thu, Mar 15, 2012 at 6:17 AM, Mohit Anchlia mohitanch...@gmail.com
  wrote:
 
   I have a client program that creates sequencefile, which essentially
  merges
   small files into a big file. I was wondering how is sequence file
  splitting
   the data accross nodes. When I start the sequence file is empty. Does
 it
   get split when it reaches the dfs.block size? If so then does it mean
  that
   I am always writing to just one node at a given point in time?
  
   If I start a new client writing a new sequence file then is there a way
  to
   select a different data node?
  
 



SequenceFile split question

2012-03-14 Thread Mohit Anchlia
I have a client program that creates sequencefile, which essentially merges
small files into a big file. I was wondering how is sequence file splitting
the data accross nodes. When I start the sequence file is empty. Does it
get split when it reaches the dfs.block size? If so then does it mean that
I am always writing to just one node at a given point in time?

If I start a new client writing a new sequence file then is there a way to
select a different data node?