Back to your original question, most likely the HDFS transfer is faster because 
HDFS operates in parallel on the cluster, whereas it sounds like your SCP 
transfer is coming from a single server.
There may be additional benefit to be gained by suggesting data-local 
scheduling of tasks, but we'd need to know more about your application.
john
From: rab ra [mailto:[email protected]]
Sent: Saturday, January 25, 2014 7:29 AM
To: [email protected]
Subject: RE: HDFS data transfer is faster than SCP based transfer?


The input files are provided as argument to a binary being executed by map 
process. This binary cannot read from hdfs and i cant rewrite it.
On 25 Jan 2014 19:47, "John Lilley" 
<[email protected]<mailto:[email protected]>> wrote:
There are no short-circuit writes, only reads, AFAIK.
Is it necessary to transfer from HDFS to local disk?  Can you read from HDFS 
directly using the FileSystem interface?
john

From: Shekhar Sharma 
[mailto:[email protected]<mailto:[email protected]>]
Sent: Saturday, January 25, 2014 3:44 AM
To: [email protected]<mailto:[email protected]>
Subject: Re: HDFS data transfer is faster than SCP based transfer?


We have the concept of short circuit reads which directly reads from data node 
which improve read performance. Do we have similar concept like short circuit 
writes
On 25 Jan 2014 16:10, "Harsh J" <[email protected]<mailto:[email protected]>> 
wrote:
There's a lot of difference here, although both do use TCP underneath,
but do note that SCP securely encrypts data but stock HDFS
configuration does not.

You can also ask SCP to compress data transfer via the "-C" argument
btw - unsure if you already applied that pre-test - it may help show
up some difference. Also, the encryption algorithm can be changed to a
weaker one if security is not a concern during the transfer, via "-c
arcfour".

On Fri, Jan 24, 2014 at 10:55 AM, rab ra 
<[email protected]<mailto:[email protected]>> wrote:
> Hello
>
> I have a use case that requires transfer of input files from remote storage
> using SCP protocol (using jSCH jar).  To optimize this use case, I have
> pre-loaded all my input files into HDFS and modified my use case so that it
> copies required files from HDFS. So, when tasktrackers works, it copies
> required number of input files to its local directory from HDFS. All my
> tasktrackers are also datanodes. I could see my use case has run faster. The
> only modification in my application is that file copy from HDFS instead of
> transfer using SCP. Also, my use case involves parallel operations (run in
> tasktrackers) and they do lot of file transfer. Now all these transfers are
> replaced with HDFS copy.
>
> Can anyone tell me HDFS transfer is faster as I witnessed? Is it because, it
> uses TCP/IP? Can anyone give me reasonable reasons to support the decrease
> of time?
>
>
> with thanks and regards
> rab



--
Harsh J

Reply via email to