Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-04-01 Thread Himanish Kushary
h Kushary [mailto:himan...@gmail.com] > *Sent:* Friday, March 29, 2013 9:57 PM > > *To:* user@hadoop.apache.org > *Subject:* Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput** > ** > > ** ** > > Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs &

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-30 Thread David Parks
. Dave From: Himanish Kushary [mailto:himan...@gmail.com] Sent: Friday, March 29, 2013 9:57 PM To: user@hadoop.apache.org Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput Yes you are right CDH4 is the 2.x line, but I even checked in the javadocs for 1.0.4 branch

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-29 Thread Jens Scheidtmann
Hi Himanish, 2013/3/29 HimaHnish Kushary > [...] > > > But the real issue is the throughput. You mentioned that you had > transferred 1.5 TB in 45 mins which comes to around 583 MB/s. I am barely > getting 4 MB/s upload speed > How large is your outgoing link? Can you expect 500 MB/s with it?

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-29 Thread Himanish Kushary
;/" + JobUtils.*isoDate* + "/output/itemtable/", >> >> "--s3Endpoint", "s3.amazonaws.com" }); >> >> ** ** >> >> Watch the “srcPattern”, make sure you have that leading `.*`, that one >> threw me for a

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-29 Thread David Parks
".*part.*", > >  "--dest",    "s3n://fruggmapreduce/results-"+env+"/" + >JobUtils.isoDate + "/output/itemtable/", > >  "--s3Endpoint",      "s3.amazonaws.com"     }); > >  >

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-29 Thread Himanish Kushary
zonaws.com" }); > > ** ** > > Watch the “srcPattern”, make sure you have that leading `.*`, that one > threw me for a loop once. > > ** ** > > Dave > > ** ** > > ** ** > > *From:* Himanish Kushary [mailto:himan...@gmail.co

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread David Parks
threw me for a loop once. Dave From: Himanish Kushary [mailto:himan...@gmail.com] Sent: Thursday, March 28, 2013 5:51 PM To: user@hadoop.apache.org Subject: Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput Hi Dave, Thanks for your reply. Our hadoop instance is inside ou

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread Himanish Kushary
Hi Dave, Thanks for your reply. Our hadoop instance is inside our corporate LAN.Could you please provide some details on how i could use the s3distcp from amazon to transfer data from our on-premises hadoop to amazon s3. Wouldn't some kind of VPN be needed between the Amazon EMR instance and our o

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread David Parks
Have you tried using s3distcp from amazon? I used it many times to transfer 1.5TB between S3 and Hadoop instances. The process took 45 min, well over the 10min timeout period you're running into a problem on. Dave From: Himanish Kushary [mailto:himan...@gmail.com] Sent: Thursday, March

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread Ted Dunning
The EMR distributions have special versions of the s3 file system. They might be helpful here. Of course, you likely aren't running those if you are seeing 5MB/s. An extreme alternative would be to light up an EMR cluster, copy to it, then to S3. On Thu, Mar 28, 2013 at 4:54 AM, Himanish Kusha