Are hadoop fs commands serial or parallel
Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ
Re: Are hadoop fs commands serial or parallel
The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn wrote: > Hi, > My question is when I run a command from hdfs client, for eg. hadoop fs > -copyFromLocal or create a sequence file writer in java code and append > key/values to it through Hadoop APIs, does it internally transfer/write data > to HDFS serially or in parallel ? > > Thanks in advance, > -JJ > -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Re: Are hadoop fs commands serial or parallel
Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria wrote: > The sequence file writer definitely does it serially as you can only > ever write to the end of a file in Hadoop. > > Doing copyFromLocal could write multiple files in parallel (I'm not > sure if it does or not), but a single file would be written serially. > > -Joey > > On Tue, May 17, 2011 at 5:44 PM, Mapred Learn wrote: >> Hi, >> My question is when I run a command from hdfs client, for eg. hadoop fs >> -copyFromLocal or create a sequence file writer in java code and append >> key/values to it through Hadoop APIs, does it internally transfer/write data >> to HDFS serially or in parallel ? >> >> Thanks in advance, >> -JJ >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434
Re: Are hadoop fs commands serial or parallel
Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn wrote: > Thanks Joey ! > I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. > > Thanks, > -JJ > > On May 17, 2011, at 8:32 PM, Joey Echeverria wrote: > >> The sequence file writer definitely does it serially as you can only >> ever write to the end of a file in Hadoop. >> >> Doing copyFromLocal could write multiple files in parallel (I'm not >> sure if it does or not), but a single file would be written serially. >> >> -Joey >> >> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn wrote: >>> Hi, >>> My question is when I run a command from hdfs client, for eg. hadoop fs >>> -copyFromLocal or create a sequence file writer in java code and append >>> key/values to it through Hadoop APIs, does it internally transfer/write data >>> to HDFS serially or in parallel ? >>> >>> Thanks in advance, >>> -JJ >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 > -- Harsh J
Re: Are hadoop fs commands serial or parallel
Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J wrote: > Hello, > > Adding to Joey's response, copyFromLocal's current implementation is serial > given a list of files. > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn > wrote: >> Thanks Joey ! >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write > serially as you pointed out. >> >> Thanks, >> -JJ >> >> On May 17, 2011, at 8:32 PM, Joey Echeverria wrote: >> >>> The sequence file writer definitely does it serially as you can only >>> ever write to the end of a file in Hadoop. >>> >>> Doing copyFromLocal could write multiple files in parallel (I'm not >>> sure if it does or not), but a single file would be written serially. >>> >>> -Joey >>> >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn > wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write > data to HDFS serially or in parallel ? Thanks in advance, -JJ >>> >>> >>> >>> -- >>> Joseph Echeverria >>> Cloudera, Inc. >>> 443.305.9434 >> > > -- > Harsh J
Re: Are hadoop fs commands serial or parallel
kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH & done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote: > Thanks harsh ! > That means basically both APIs as well as hadoop client commands allow only > serial writes. > I was wondering what could be other ways to write data in parallel to HDFS > other than using multiple parallel threads. > > Thanks, > JJ > > Sent from my iPhone > > On May 17, 2011, at 10:59 PM, Harsh J wrote: > > > Hello, > > > > Adding to Joey's response, copyFromLocal's current implementation is > serial > > given a list of files. > > > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn > > wrote: > >> Thanks Joey ! > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write > > serially as you pointed out. > >> > >> Thanks, > >> -JJ > >> > >> On May 17, 2011, at 8:32 PM, Joey Echeverria wrote: > >> > >>> The sequence file writer definitely does it serially as you can only > >>> ever write to the end of a file in Hadoop. > >>> > >>> Doing copyFromLocal could write multiple files in parallel (I'm not > >>> sure if it does or not), but a single file would be written serially. > >>> > >>> -Joey > >>> > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn > > wrote: > Hi, > My question is when I run a command from hdfs client, for eg. hadoop > fs > -copyFromLocal or create a sequence file writer in java code and > append > key/values to it through Hadoop APIs, does it internally > transfer/write > > data > to HDFS serially or in parallel ? > > Thanks in advance, > -JJ > > >>> > >>> > >>> > >>> -- > >>> Joseph Echeverria > >>> Cloudera, Inc. > >>> 443.305.9434 > >> > > > > -- > > Harsh J >
Re: Are hadoop fs commands serial or parallel
Thanks Patrick ! This would work if directory is to be uploaded but for streaming, I guess, this would not work. Sent from my iPhone On May 18, 2011, at 9:39 AM, Patrick Angeles wrote: > kinda clunky but you could do this via shell: > > for $FILE in $LIST_OF_FILES ; do > hadoop fs -copyFromLocal $FILE $DEST_PATH & > done > > If doing this via the Java API, then, yes you will have to use multiple > threads. > > On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote: > >> Thanks harsh ! >> That means basically both APIs as well as hadoop client commands allow only >> serial writes. >> I was wondering what could be other ways to write data in parallel to HDFS >> other than using multiple parallel threads. >> >> Thanks, >> JJ >> >> Sent from my iPhone >> >> On May 17, 2011, at 10:59 PM, Harsh J wrote: >> >>> Hello, >>> >>> Adding to Joey's response, copyFromLocal's current implementation is >> serial >>> given a list of files. >>> >>> On Wed, May 18, 2011 at 9:57 AM, Mapred Learn >>> wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write >>> serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria wrote: > The sequence file writer definitely does it serially as you can only > ever write to the end of a file in Hadoop. > > Doing copyFromLocal could write multiple files in parallel (I'm not > sure if it does or not), but a single file would be written serially. > > -Joey > > On Tue, May 17, 2011 at 5:44 PM, Mapred Learn >>> wrote: >> Hi, >> My question is when I run a command from hdfs client, for eg. hadoop >> fs >> -copyFromLocal or create a sequence file writer in java code and >> append >> key/values to it through Hadoop APIs, does it internally >> transfer/write >>> data >> to HDFS serially or in parallel ? >> >> Thanks in advance, >> -JJ >> > > > > -- > Joseph Echeverria > Cloudera, Inc. > 443.305.9434 >>> >>> -- >>> Harsh J >>
Re: Are hadoop fs commands serial or parallel
What do you mean clunky? IMHO this is a quite elegant, simple, working solution. Sure this spawns multiple processes, but it beats any api-overcomplications, imho. Dieter On Wed, 18 May 2011 11:39:36 -0500 Patrick Angeles wrote: > kinda clunky but you could do this via shell: > > for $FILE in $LIST_OF_FILES ; do > hadoop fs -copyFromLocal $FILE $DEST_PATH & > done > > If doing this via the Java API, then, yes you will have to use > multiple threads. > > On Wed, May 18, 2011 at 1:04 AM, Mapred Learn > wrote: > > > Thanks harsh ! > > That means basically both APIs as well as hadoop client commands > > allow only serial writes. > > I was wondering what could be other ways to write data in parallel > > to HDFS other than using multiple parallel threads. > > > > Thanks, > > JJ > > > > Sent from my iPhone > > > > On May 17, 2011, at 10:59 PM, Harsh J wrote: > > > > > Hello, > > > > > > Adding to Joey's response, copyFromLocal's current implementation > > > is > > serial > > > given a list of files. > > > > > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn > > > wrote: > > >> Thanks Joey ! > > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis > > >> write > > > serially as you pointed out. > > >> > > >> Thanks, > > >> -JJ > > >> > > >> On May 17, 2011, at 8:32 PM, Joey Echeverria > > >> wrote: > > >> > > >>> The sequence file writer definitely does it serially as you can > > >>> only ever write to the end of a file in Hadoop. > > >>> > > >>> Doing copyFromLocal could write multiple files in parallel (I'm > > >>> not sure if it does or not), but a single file would be written > > >>> serially. > > >>> > > >>> -Joey > > >>> > > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn > > >>> > > > wrote: > > Hi, > > My question is when I run a command from hdfs client, for eg. > > hadoop > > fs > > -copyFromLocal or create a sequence file writer in java code > > and > > append > > key/values to it through Hadoop APIs, does it internally > > transfer/write > > > data > > to HDFS serially or in parallel ? > > > > Thanks in advance, > > -JJ > > > > >>> > > >>> > > >>> > > >>> -- > > >>> Joseph Echeverria > > >>> Cloudera, Inc. > > >>> 443.305.9434 > > >> > > > > > > -- > > > Harsh J > >
Re: Are hadoop fs commands serial or parallel
On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: > What do you mean clunky? > IMHO this is a quite elegant, simple, working solution. Try giving it to a user; watch them feed it a list of 10,000 files; watch the machine swap to death and the disks uselessly thrash. > Sure this spawns multiple processes, but it beats any > api-overcomplications, imho. > Simple doesn't imply scalable, unfortunately. Brian > Dieter > > > On Wed, 18 May 2011 11:39:36 -0500 > Patrick Angeles wrote: > >> kinda clunky but you could do this via shell: >> >> for $FILE in $LIST_OF_FILES ; do >> hadoop fs -copyFromLocal $FILE $DEST_PATH & >> done >> >> If doing this via the Java API, then, yes you will have to use >> multiple threads. >> >> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn >> wrote: >> >>> Thanks harsh ! >>> That means basically both APIs as well as hadoop client commands >>> allow only serial writes. >>> I was wondering what could be other ways to write data in parallel >>> to HDFS other than using multiple parallel threads. >>> >>> Thanks, >>> JJ >>> >>> Sent from my iPhone >>> >>> On May 17, 2011, at 10:59 PM, Harsh J wrote: >>> Hello, Adding to Joey's response, copyFromLocal's current implementation is >>> serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn wrote: > Thanks Joey ! > I will try to find out abt copyFromLocal. Looks like Hadoop Apis > write serially as you pointed out. > > Thanks, > -JJ > > On May 17, 2011, at 8:32 PM, Joey Echeverria > wrote: > >> The sequence file writer definitely does it serially as you can >> only ever write to the end of a file in Hadoop. >> >> Doing copyFromLocal could write multiple files in parallel (I'm >> not sure if it does or not), but a single file would be written >> serially. >> >> -Joey >> >> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn >> wrote: >>> Hi, >>> My question is when I run a command from hdfs client, for eg. >>> hadoop >>> fs >>> -copyFromLocal or create a sequence file writer in java code >>> and >>> append >>> key/values to it through Hadoop APIs, does it internally >>> transfer/write data >>> to HDFS serially or in parallel ? >>> >>> Thanks in advance, >>> -JJ >>> >> >> >> >> -- >> Joseph Echeverria >> Cloudera, Inc. >> 443.305.9434 > -- Harsh J >>> smime.p7s Description: S/MIME cryptographic signature
Re: Are hadoop fs commands serial or parallel
On Fri, 20 May 2011 10:11:13 -0500 Brian Bockelman wrote: > > On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote: > > > What do you mean clunky? > > IMHO this is a quite elegant, simple, working solution. > > Try giving it to a user; watch them feed it a list of 10,000 files; > watch the machine swap to death and the disks uselessly thrash. > > > Sure this spawns multiple processes, but it beats any > > api-overcomplications, imho. > > > > Simple doesn't imply scalable, unfortunately. > > Brian True, I assumed if anyone wants this, he knows what he's doing (i.e. the files could be small and already in the Linux block cache). Because why would anyone read files in parrallel if that causes disk seeks all over the place? Ideally, you should tune for 1 sequential read per disk at the time. In that respect, I definitely agree that some clever logic in userspace to optimize disk reads (across a bunch of different possible hardware setups) would be beneficial. Dieter
Re: Are hadoop fs commands serial or parallel
Hi guys, Another question related to it is that when you do hadoop fs -copyFromLocal or use API to call fs.write(), does it write to local filesystem first before writing to HDFS. I read and found out that it writes on local file-system until block-size is reached and then writes on HDFS. Wouldn't HDFS Client choke if it writes to local filesystem if multiple such fs -copyFromLocal commands are running. I thought atleast in fs.write(), if you provide byte array, it should not write on local file-system ? Could somebody tell how fs -copyFromLocal and fs.write() work ? Do they write on local-filesystem beofre block size is reached and then write to HDFS or write directly to HDFS ? Thanks in advance, -JJ On Wed, May 18, 2011 at 9:39 AM, Patrick Angeles wrote: > kinda clunky but you could do this via shell: > > for $FILE in $LIST_OF_FILES ; do > hadoop fs -copyFromLocal $FILE $DEST_PATH & > done > > If doing this via the Java API, then, yes you will have to use multiple > threads. > > On Wed, May 18, 2011 at 1:04 AM, Mapred Learn >wrote: > > > Thanks harsh ! > > That means basically both APIs as well as hadoop client commands allow > only > > serial writes. > > I was wondering what could be other ways to write data in parallel to > HDFS > > other than using multiple parallel threads. > > > > Thanks, > > JJ > > > > Sent from my iPhone > > > > On May 17, 2011, at 10:59 PM, Harsh J wrote: > > > > > Hello, > > > > > > Adding to Joey's response, copyFromLocal's current implementation is > > serial > > > given a list of files. > > > > > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn > > > wrote: > > >> Thanks Joey ! > > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write > > > serially as you pointed out. > > >> > > >> Thanks, > > >> -JJ > > >> > > >> On May 17, 2011, at 8:32 PM, Joey Echeverria > wrote: > > >> > > >>> The sequence file writer definitely does it serially as you can only > > >>> ever write to the end of a file in Hadoop. > > >>> > > >>> Doing copyFromLocal could write multiple files in parallel (I'm not > > >>> sure if it does or not), but a single file would be written serially. > > >>> > > >>> -Joey > > >>> > > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn < > mapred.le...@gmail.com> > > > wrote: > > Hi, > > My question is when I run a command from hdfs client, for eg. hadoop > > fs > > -copyFromLocal or create a sequence file writer in java code and > > append > > key/values to it through Hadoop APIs, does it internally > > transfer/write > > > data > > to HDFS serially or in parallel ? > > > > Thanks in advance, > > -JJ > > > > >>> > > >>> > > >>> > > >>> -- > > >>> Joseph Echeverria > > >>> Cloudera, Inc. > > >>> 443.305.9434 > > >> > > > > > > -- > > > Harsh J > > >