Are hadoop fs commands serial or parallel

2011-05-17 Thread Mapred Learn
 Hi,
My question is when I run a command from hdfs client, for eg. hadoop fs
-copyFromLocal or create a sequence file writer in java code and append
key/values to it through Hadoop APIs, does it internally transfer/write data
to HDFS serially or in parallel ?

Thanks in advance,
-JJ


Re: Are hadoop fs commands serial or parallel

2011-05-17 Thread Joey Echeverria
The sequence file writer definitely does it serially as you can only
ever write to the end of a file in Hadoop.

Doing copyFromLocal could write multiple files in parallel (I'm not
sure if it does or not), but a single file would be written serially.

-Joey

On Tue, May 17, 2011 at 5:44 PM, Mapred Learn  wrote:
>  Hi,
> My question is when I run a command from hdfs client, for eg. hadoop fs
> -copyFromLocal or create a sequence file writer in java code and append
> key/values to it through Hadoop APIs, does it internally transfer/write data
> to HDFS serially or in parallel ?
>
> Thanks in advance,
> -JJ
>



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Are hadoop fs commands serial or parallel

2011-05-17 Thread Mapred Learn
Thanks Joey !
I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially 
as you pointed out.

Thanks,
-JJ

On May 17, 2011, at 8:32 PM, Joey Echeverria  wrote:

> The sequence file writer definitely does it serially as you can only
> ever write to the end of a file in Hadoop.
> 
> Doing copyFromLocal could write multiple files in parallel (I'm not
> sure if it does or not), but a single file would be written serially.
> 
> -Joey
> 
> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn  wrote:
>>  Hi,
>> My question is when I run a command from hdfs client, for eg. hadoop fs
>> -copyFromLocal or create a sequence file writer in java code and append
>> key/values to it through Hadoop APIs, does it internally transfer/write data
>> to HDFS serially or in parallel ?
>> 
>> Thanks in advance,
>> -JJ
>> 
> 
> 
> 
> -- 
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434


Re: Are hadoop fs commands serial or parallel

2011-05-17 Thread Harsh J
Hello,

Adding to Joey's response, copyFromLocal's current implementation is serial
given a list of files.

On Wed, May 18, 2011 at 9:57 AM, Mapred Learn 
wrote:
> Thanks Joey !
> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
serially as you pointed out.
>
> Thanks,
> -JJ
>
> On May 17, 2011, at 8:32 PM, Joey Echeverria  wrote:
>
>> The sequence file writer definitely does it serially as you can only
>> ever write to the end of a file in Hadoop.
>>
>> Doing copyFromLocal could write multiple files in parallel (I'm not
>> sure if it does or not), but a single file would be written serially.
>>
>> -Joey
>>
>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn 
wrote:
>>>  Hi,
>>> My question is when I run a command from hdfs client, for eg. hadoop fs
>>> -copyFromLocal or create a sequence file writer in java code and append
>>> key/values to it through Hadoop APIs, does it internally transfer/write
data
>>> to HDFS serially or in parallel ?
>>>
>>> Thanks in advance,
>>> -JJ
>>>
>>
>>
>>
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
>

-- 
Harsh J


Re: Are hadoop fs commands serial or parallel

2011-05-17 Thread Mapred Learn
Thanks harsh !
That means basically both APIs as well as hadoop client commands allow only 
serial writes.
I was wondering what could be other ways to write data in parallel to HDFS 
other than using multiple parallel threads.

Thanks,
JJ

Sent from my iPhone

On May 17, 2011, at 10:59 PM, Harsh J  wrote:

> Hello,
> 
> Adding to Joey's response, copyFromLocal's current implementation is serial
> given a list of files.
> 
> On Wed, May 18, 2011 at 9:57 AM, Mapred Learn 
> wrote:
>> Thanks Joey !
>> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
> serially as you pointed out.
>> 
>> Thanks,
>> -JJ
>> 
>> On May 17, 2011, at 8:32 PM, Joey Echeverria  wrote:
>> 
>>> The sequence file writer definitely does it serially as you can only
>>> ever write to the end of a file in Hadoop.
>>> 
>>> Doing copyFromLocal could write multiple files in parallel (I'm not
>>> sure if it does or not), but a single file would be written serially.
>>> 
>>> -Joey
>>> 
>>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn 
> wrote:
 Hi,
 My question is when I run a command from hdfs client, for eg. hadoop fs
 -copyFromLocal or create a sequence file writer in java code and append
 key/values to it through Hadoop APIs, does it internally transfer/write
> data
 to HDFS serially or in parallel ?
 
 Thanks in advance,
 -JJ
 
>>> 
>>> 
>>> 
>>> --
>>> Joseph Echeverria
>>> Cloudera, Inc.
>>> 443.305.9434
>> 
> 
> -- 
> Harsh J


Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Patrick Angeles
kinda clunky but you could do this via shell:

for $FILE in $LIST_OF_FILES ; do
  hadoop fs -copyFromLocal $FILE $DEST_PATH &
done

If doing this via the Java API, then, yes you will have to use multiple
threads.

On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote:

> Thanks harsh !
> That means basically both APIs as well as hadoop client commands allow only
> serial writes.
> I was wondering what could be other ways to write data in parallel to HDFS
> other than using multiple parallel threads.
>
> Thanks,
> JJ
>
> Sent from my iPhone
>
> On May 17, 2011, at 10:59 PM, Harsh J  wrote:
>
> > Hello,
> >
> > Adding to Joey's response, copyFromLocal's current implementation is
> serial
> > given a list of files.
> >
> > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn 
> > wrote:
> >> Thanks Joey !
> >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
> > serially as you pointed out.
> >>
> >> Thanks,
> >> -JJ
> >>
> >> On May 17, 2011, at 8:32 PM, Joey Echeverria  wrote:
> >>
> >>> The sequence file writer definitely does it serially as you can only
> >>> ever write to the end of a file in Hadoop.
> >>>
> >>> Doing copyFromLocal could write multiple files in parallel (I'm not
> >>> sure if it does or not), but a single file would be written serially.
> >>>
> >>> -Joey
> >>>
> >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn 
> > wrote:
>  Hi,
>  My question is when I run a command from hdfs client, for eg. hadoop
> fs
>  -copyFromLocal or create a sequence file writer in java code and
> append
>  key/values to it through Hadoop APIs, does it internally
> transfer/write
> > data
>  to HDFS serially or in parallel ?
> 
>  Thanks in advance,
>  -JJ
> 
> >>>
> >>>
> >>>
> >>> --
> >>> Joseph Echeverria
> >>> Cloudera, Inc.
> >>> 443.305.9434
> >>
> >
> > --
> > Harsh J
>


Re: Are hadoop fs commands serial or parallel

2011-05-18 Thread Mapred Learn
Thanks Patrick !
This would work if directory is to be uploaded but for streaming, I guess, this 
would not work.

Sent from my iPhone

On May 18, 2011, at 9:39 AM, Patrick Angeles  wrote:

> kinda clunky but you could do this via shell:
> 
> for $FILE in $LIST_OF_FILES ; do
>  hadoop fs -copyFromLocal $FILE $DEST_PATH &
> done
> 
> If doing this via the Java API, then, yes you will have to use multiple
> threads.
> 
> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn wrote:
> 
>> Thanks harsh !
>> That means basically both APIs as well as hadoop client commands allow only
>> serial writes.
>> I was wondering what could be other ways to write data in parallel to HDFS
>> other than using multiple parallel threads.
>> 
>> Thanks,
>> JJ
>> 
>> Sent from my iPhone
>> 
>> On May 17, 2011, at 10:59 PM, Harsh J  wrote:
>> 
>>> Hello,
>>> 
>>> Adding to Joey's response, copyFromLocal's current implementation is
>> serial
>>> given a list of files.
>>> 
>>> On Wed, May 18, 2011 at 9:57 AM, Mapred Learn 
>>> wrote:
 Thanks Joey !
 I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
>>> serially as you pointed out.
 
 Thanks,
 -JJ
 
 On May 17, 2011, at 8:32 PM, Joey Echeverria  wrote:
 
> The sequence file writer definitely does it serially as you can only
> ever write to the end of a file in Hadoop.
> 
> Doing copyFromLocal could write multiple files in parallel (I'm not
> sure if it does or not), but a single file would be written serially.
> 
> -Joey
> 
> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn 
>>> wrote:
>> Hi,
>> My question is when I run a command from hdfs client, for eg. hadoop
>> fs
>> -copyFromLocal or create a sequence file writer in java code and
>> append
>> key/values to it through Hadoop APIs, does it internally
>> transfer/write
>>> data
>> to HDFS serially or in parallel ?
>> 
>> Thanks in advance,
>> -JJ
>> 
> 
> 
> 
> --
> Joseph Echeverria
> Cloudera, Inc.
> 443.305.9434
 
>>> 
>>> --
>>> Harsh J
>> 


Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Dieter Plaetinck
What do you mean clunky?
IMHO this is a quite elegant, simple, working solution.
Sure this spawns multiple processes, but it beats any
api-overcomplications, imho.

Dieter


On Wed, 18 May 2011 11:39:36 -0500
Patrick Angeles  wrote:

> kinda clunky but you could do this via shell:
> 
> for $FILE in $LIST_OF_FILES ; do
>   hadoop fs -copyFromLocal $FILE $DEST_PATH &
> done
> 
> If doing this via the Java API, then, yes you will have to use
> multiple threads.
> 
> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
> wrote:
> 
> > Thanks harsh !
> > That means basically both APIs as well as hadoop client commands
> > allow only serial writes.
> > I was wondering what could be other ways to write data in parallel
> > to HDFS other than using multiple parallel threads.
> >
> > Thanks,
> > JJ
> >
> > Sent from my iPhone
> >
> > On May 17, 2011, at 10:59 PM, Harsh J  wrote:
> >
> > > Hello,
> > >
> > > Adding to Joey's response, copyFromLocal's current implementation
> > > is
> > serial
> > > given a list of files.
> > >
> > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
> > >  wrote:
> > >> Thanks Joey !
> > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis
> > >> write
> > > serially as you pointed out.
> > >>
> > >> Thanks,
> > >> -JJ
> > >>
> > >> On May 17, 2011, at 8:32 PM, Joey Echeverria 
> > >> wrote:
> > >>
> > >>> The sequence file writer definitely does it serially as you can
> > >>> only ever write to the end of a file in Hadoop.
> > >>>
> > >>> Doing copyFromLocal could write multiple files in parallel (I'm
> > >>> not sure if it does or not), but a single file would be written
> > >>> serially.
> > >>>
> > >>> -Joey
> > >>>
> > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
> > >>> 
> > > wrote:
> >  Hi,
> >  My question is when I run a command from hdfs client, for eg.
> >  hadoop
> > fs
> >  -copyFromLocal or create a sequence file writer in java code
> >  and
> > append
> >  key/values to it through Hadoop APIs, does it internally
> > transfer/write
> > > data
> >  to HDFS serially or in parallel ?
> > 
> >  Thanks in advance,
> >  -JJ
> > 
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Joseph Echeverria
> > >>> Cloudera, Inc.
> > >>> 443.305.9434
> > >>
> > >
> > > --
> > > Harsh J
> >



Re: Are hadoop fs commands serial or parallel

2011-05-20 Thread Brian Bockelman

On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:

> What do you mean clunky?
> IMHO this is a quite elegant, simple, working solution.

Try giving it to a user; watch them feed it a list of 10,000 files; watch the 
machine swap to death and the disks uselessly thrash.

> Sure this spawns multiple processes, but it beats any
> api-overcomplications, imho.
> 

Simple doesn't imply scalable, unfortunately.

Brian

> Dieter
> 
> 
> On Wed, 18 May 2011 11:39:36 -0500
> Patrick Angeles  wrote:
> 
>> kinda clunky but you could do this via shell:
>> 
>> for $FILE in $LIST_OF_FILES ; do
>>  hadoop fs -copyFromLocal $FILE $DEST_PATH &
>> done
>> 
>> If doing this via the Java API, then, yes you will have to use
>> multiple threads.
>> 
>> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn
>> wrote:
>> 
>>> Thanks harsh !
>>> That means basically both APIs as well as hadoop client commands
>>> allow only serial writes.
>>> I was wondering what could be other ways to write data in parallel
>>> to HDFS other than using multiple parallel threads.
>>> 
>>> Thanks,
>>> JJ
>>> 
>>> Sent from my iPhone
>>> 
>>> On May 17, 2011, at 10:59 PM, Harsh J  wrote:
>>> 
 Hello,
 
 Adding to Joey's response, copyFromLocal's current implementation
 is
>>> serial
 given a list of files.
 
 On Wed, May 18, 2011 at 9:57 AM, Mapred Learn
  wrote:
> Thanks Joey !
> I will try to find out abt copyFromLocal. Looks like Hadoop Apis
> write
 serially as you pointed out.
> 
> Thanks,
> -JJ
> 
> On May 17, 2011, at 8:32 PM, Joey Echeverria 
> wrote:
> 
>> The sequence file writer definitely does it serially as you can
>> only ever write to the end of a file in Hadoop.
>> 
>> Doing copyFromLocal could write multiple files in parallel (I'm
>> not sure if it does or not), but a single file would be written
>> serially.
>> 
>> -Joey
>> 
>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn
>> 
 wrote:
>>> Hi,
>>> My question is when I run a command from hdfs client, for eg.
>>> hadoop
>>> fs
>>> -copyFromLocal or create a sequence file writer in java code
>>> and
>>> append
>>> key/values to it through Hadoop APIs, does it internally
>>> transfer/write
 data
>>> to HDFS serially or in parallel ?
>>> 
>>> Thanks in advance,
>>> -JJ
>>> 
>> 
>> 
>> 
>> --
>> Joseph Echeverria
>> Cloudera, Inc.
>> 443.305.9434
> 
 
 --
 Harsh J
>>> 



smime.p7s
Description: S/MIME cryptographic signature


Re: Are hadoop fs commands serial or parallel

2011-05-23 Thread Dieter Plaetinck
On Fri, 20 May 2011 10:11:13 -0500
Brian Bockelman  wrote:

> 
> On May 20, 2011, at 6:10 AM, Dieter Plaetinck wrote:
> 
> > What do you mean clunky?
> > IMHO this is a quite elegant, simple, working solution.
> 
> Try giving it to a user; watch them feed it a list of 10,000 files;
> watch the machine swap to death and the disks uselessly thrash.
> 
> > Sure this spawns multiple processes, but it beats any
> > api-overcomplications, imho.
> > 
> 
> Simple doesn't imply scalable, unfortunately.
> 
> Brian

True, I assumed if anyone wants this, he knows what he's doing (i.e.
the files could be small and already in the Linux block cache).
Because why would anyone read files in parrallel if that causes disk
seeks all over the place? Ideally, you should tune for 1 sequential read
per disk at the time. In that respect, I definitely agree that some
clever logic in userspace to optimize disk reads (across a bunch of
different possible hardware setups) would be beneficial.

Dieter


Re: Are hadoop fs commands serial or parallel

2011-05-26 Thread Mapred Learn
Hi guys,
Another question related to it is that when you do hadoop fs -copyFromLocal
or use
API to call fs.write(), does it write to local filesystem first before
writing to HDFS. I read and found out that it writes on local file-system
until block-size is reached and then writes on HDFS.
Wouldn't HDFS Client choke if it writes to local filesystem if multiple such
fs -copyFromLocal commands are running. I thought atleast in fs.write(), if
you provide byte array, it should not write on local file-system ?

Could somebody tell how fs -copyFromLocal and fs.write() work ? Do they
write on local-filesystem beofre block size is reached and then write to
HDFS or write directly to HDFS ?

Thanks in advance,
-JJ

On Wed, May 18, 2011 at 9:39 AM, Patrick Angeles wrote:

> kinda clunky but you could do this via shell:
>
> for $FILE in $LIST_OF_FILES ; do
>  hadoop fs -copyFromLocal $FILE $DEST_PATH &
> done
>
> If doing this via the Java API, then, yes you will have to use multiple
> threads.
>
> On Wed, May 18, 2011 at 1:04 AM, Mapred Learn  >wrote:
>
> > Thanks harsh !
> > That means basically both APIs as well as hadoop client commands allow
> only
> > serial writes.
> > I was wondering what could be other ways to write data in parallel to
> HDFS
> > other than using multiple parallel threads.
> >
> > Thanks,
> > JJ
> >
> > Sent from my iPhone
> >
> > On May 17, 2011, at 10:59 PM, Harsh J  wrote:
> >
> > > Hello,
> > >
> > > Adding to Joey's response, copyFromLocal's current implementation is
> > serial
> > > given a list of files.
> > >
> > > On Wed, May 18, 2011 at 9:57 AM, Mapred Learn 
> > > wrote:
> > >> Thanks Joey !
> > >> I will try to find out abt copyFromLocal. Looks like Hadoop Apis write
> > > serially as you pointed out.
> > >>
> > >> Thanks,
> > >> -JJ
> > >>
> > >> On May 17, 2011, at 8:32 PM, Joey Echeverria 
> wrote:
> > >>
> > >>> The sequence file writer definitely does it serially as you can only
> > >>> ever write to the end of a file in Hadoop.
> > >>>
> > >>> Doing copyFromLocal could write multiple files in parallel (I'm not
> > >>> sure if it does or not), but a single file would be written serially.
> > >>>
> > >>> -Joey
> > >>>
> > >>> On Tue, May 17, 2011 at 5:44 PM, Mapred Learn <
> mapred.le...@gmail.com>
> > > wrote:
> >  Hi,
> >  My question is when I run a command from hdfs client, for eg. hadoop
> > fs
> >  -copyFromLocal or create a sequence file writer in java code and
> > append
> >  key/values to it through Hadoop APIs, does it internally
> > transfer/write
> > > data
> >  to HDFS serially or in parallel ?
> > 
> >  Thanks in advance,
> >  -JJ
> > 
> > >>>
> > >>>
> > >>>
> > >>> --
> > >>> Joseph Echeverria
> > >>> Cloudera, Inc.
> > >>> 443.305.9434
> > >>
> > >
> > > --
> > > Harsh J
> >
>