Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Guang-Nan Cheng
Well, my bad. I made a simple test and confirmed that -files works that way already. For the two guys that "answered" my question, sorry I asked the question unclearly... I don't see how those two projects related to the question, but thank you. :D On Wed, Jun 29, 2011 at 12:35 AM, Abhinay M

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Bharath Mundlapudi
One could argue that its too early for 10Gb NIC in Hadoop Cluster. Certainly having extra bandwidth is good but at what price? Please note that all the points you mentioned can work with 1Gb NICs today. Unless if you can back with price/performance data. Typically, Map output is compressed. If

Re: Help for upgrading my hadoop-0.19.1 version to hadoop-0.20.2

2011-06-28 Thread Amareshwari Sri Ramadasu
Rajesh, We don't encourage users to migrate to new api in branch 0.20, as it is not stable. Thanks Amareshwari On 6/29/11 10:24 AM, "rajesh putta" wrote: Hi, Currently i am running Hadoop-0.19.1.I want to migrate from Hadoop-0.19.1 version to Hadoop-0.20.2.Can any one suggest me how to go a

Help for upgrading my hadoop-0.19.1 version to hadoop-0.20.2

2011-06-28 Thread rajesh putta
Hi, Currently i am running Hadoop-0.19.1.I want to migrate from Hadoop-0.19.1 version to Hadoop-0.20.2.Can any one suggest me how to go ahead.The main tasks are API migration and DFS migration.Thanks in advance. Thanks & Regards Rajesh Putta

Help for upgrading my hadoop-0.19.1 version to hadoop-0.20.2

2011-06-28 Thread rajesh . p
Hi, Currently i am running Hadoop-0.19.1.I want to migrate from Hadoop-0.19.1 version to Hadoop-0.20.2.Can any one suggest me how to go ahead.The main tasks are API migration and DFS migration.Thanks in advance. Thanks & Regards Rajesh Putta

Help for upgrading my hadoop-0.19.1 version to hadoop-0.20.2

2011-06-28 Thread rajesh putta
Hi, Currently i am running Hadoop-0.19.1.I want to migrate from Hadoop-0.19.1 version to Hadoop-0.20.2.Can any one suggest me how to go ahead.The main tasks are API migration and DFS migration.Thanks in advance. Thanks & Regards Rajesh Putta

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Matt Davies
I would say this is quite a difficult choice. I've seen that our cluster could use more bandwidth, but it wasn't to the nodes that made the big difference, it was getting better switches that had better backplanes - the fabric made the difference. I've also seen some workloads where job design is

Re: Hadoop Summit - Poster 49

2011-06-28 Thread Bharath Mundlapudi
Hi Mark, Most probably session material might be online after the summit. I am not sure on that. I am just a presenter there :). If there is sufficient interest from users, i don't see why authors might not put their sessions online. Thanks for asking tough. -Bharath __

Re: How do I do a reduce-side Join on values with different serialization types?

2011-06-28 Thread Dhruv Kumar
Can you pre-process the data to adhere to a uniform serialization scheme first? Dir 1: to to Dir 2: to or Dir 1: to Dir 2: to to Next, do a reduce side join. To the best of my knowledge, Hadoop does not allow multiple types for values in the reduce side. On Tue, Jun 28, 2011 at 5:53

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Russell Jurney
Price the cost of 1GbE->10GbE vs. more nodes, using data from monitoring your cluster during peak load. It should be clear which is a better value. Russ On Tue, Jun 28, 2011 at 4:05 PM, Mathias Herberts < mathias.herbe...@gmail.com> wrote: > On Wed, Jun 29, 2011 at 01:02, Matei Zaharia > wrote

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Mathias Herberts
On Wed, Jun 29, 2011 at 01:02, Matei Zaharia wrote: > Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile > your target Hadoop workload and see whether it's communication-bound. Hadoop > jobs can definitely be communication-bound if you shuffle a lot of data > between

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread James Seigel
If you are very adhoc-y, more bandwidth the merry-er! James Sent from my mobile. Please excuse the typos. On 2011-06-28, at 5:03 PM, Matei Zaharia wrote: > Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile > your target Hadoop workload and see whether it's communi

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Matei Zaharia
Ideally, to evaluate whether you want to go for 10GbE NICs, you would profile your target Hadoop workload and see whether it's communication-bound. Hadoop jobs can definitely be communication-bound if you shuffle a lot of data between map and reduce, but I've also seen a lot of clusters that are

RE: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Matt, Thanks, this is helpful, I was wondering if you may have some thoughts on the list of other potential benefits of 10GbE NICs for Hadoop (listed in my original e-mail to the list)? regards, Saqib -Original Message- From: Matthew Foley [mailto:ma...@yahoo-inc.com] Sent: Tuesday, June

How do I do a reduce-side Join on values with different serialization types?

2011-06-28 Thread W.P. McNeill
I have two directories. Directory 1 contains values of the form and directory 2 contains values of the form . The key values are the same in the two directories. I want to take them as input and produce output of the form . A reasonable strategy is to do a reduce-side Join as described in section

hadoop pipes

2011-06-28 Thread jitter
hi i m confused about the execution of hadoop program; ahat happen when we write the hadoop pipe running command like bin/hadoop pipes -D pipie.java.record reader =true etc i don't know how the program run what does the control do; I know we compile the c++ program by g++ command and run it by ./

Re: Hadoop Summit - Poster 49

2011-06-28 Thread Mark Kerzner
Ah, I just came from Santa Clara! Will there be sessions online? Thank you, Mark On Tue, Jun 28, 2011 at 2:43 PM, Bharath Mundlapudi wrote: > Hello All, > > As you all know, tomorrow is the Hadoop Summit 2011. There will be many > interesting talks tomorrow. Don't miss any talk if you want to se

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Matthew Foley
Hadoop common provides an abstract FileSystem class, and Hadoop applications should be designed to run on that. HDFS is just one implementation of a valid Hadoop filesystem, and ports to S3 and KFS as well as OS-supported LocalFileSystem are provided in Hadoop common. Use of NFS-mounted storag

RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
Mike, Somewhat of a tangent but it is actually very informative to hear that you are getting bound by I/O with a 2:1 core to disk ratio. Could you share what you used to make those calls? We have been using both a local ganglia daemon as well as the Hadoop ganglia daemon to get an overall look

RE: Performance Tunning

2011-06-28 Thread Michael Segel
Matthew, I understood that Juan was talking about a 2 socket quad core box. We run boxes with the e5500 (xeon quad core ) chips. Linux sees these as 16 cores. Our data nodes are 32GB Ram w 4 x 2TB SATA. Its a pretty basic configuration. What I was saying was that if you consider 1 core for

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Darren Govoni
I see. However, Hadoop is designed to operate best with HDFS because of its inherent striping and blocking strategy - which is tracked by Hadoop. Going outside of that mechanism will probably yield poor results and/or confuse Hadoop. Just my thoughts. On 06/28/2011 01:27 PM, Saqib Jang -- Margal

RE: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Darren, Thanks, the last pt was basically about 10GbE potentially allowing the use of a network file system e.g. via NFS as an alternative to HDFS, the question is there any merit in this. Basically, I was exploring if the commercial clustered NAS products offer any high-availability or data manage

Re: Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Darren Govoni
Hadoop, like other parallel networked computation architectures is I/O bound, predominantly. This means any increase in network bandwidth is "A Good Thing" and can have drastic positive effects on performance. All your points stem from this simple realization. Although I'm confused by your #6.

Sanity check re: value of 10GbE NICs for Hadoop?

2011-06-28 Thread Saqib Jang -- Margalla Communications
Folks, I've been digging into the potential benefits of using 10 Gigabit Ethernet (10GbE) NIC server connections for Hadoop and wanted to run what I've come up with through initial research by the list for 'sanity check' feedback. I'd very much appreciate your input on the importance (or lac

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Abhinay Mehta
We use Mandy: https://github.com/forward/mandy for this. On 28 June 2011 17:26, Nick Jones wrote: > Take a look at Wukong from the guys at Infochimps: > https://github.com/mrflip/**wukong > > > On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote: > >> I'm fancied ab

Re: Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Nick Jones
Take a look at Wukong from the guys at Infochimps: https://github.com/mrflip/wukong On 06/28/2011 11:19 AM, Guang-Nan Cheng wrote: I'm fancied about passing a whole ruby app to streaming, so I don't need to bother with ruby file dependencies. For example, ./streaming ... -mapper 'ruby aaa/bb

Passing files and directory structures to the map reduce cluster via hadoop streaming?

2011-06-28 Thread Guang-Nan Cheng
I'm fancied about passing a whole ruby app to streaming, so I don't need to bother with ruby file dependencies. For example, ./streaming ... -mapper 'ruby aaa/bbb/ccc' -files aaa <--- pass the folder Is this supported already? If not, any tips on how to make this work? I'm willing to a

Re: Api migration from 0.19.1 to 0.20.20

2011-06-28 Thread Shi Yu
On 6/28/2011 7:12 AM, Prashant Sharma wrote: Hi , I have my source code written in 0.19.1 Hadoop API and want to shift it to newer API 0.20.20. Any clue on good documentation on migrating from older version to newer version will be very helpful. Thanks. Prashant ---

RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
Mike, I'm not really sure I have seen a community consensus around how to handle hyper-threading within Hadoop (although I have seen quite a few articles that discuss it). I was assuming that when Juan mentioned they were 4-core boxes that he meant 4 physical cores and not HT cores. I was more

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-28 Thread Jeff.Schmitz
You may also try removing the hadoop-"yourname" directory from /tmp - and reformatting HDFS - it may be corrupted -Original Message- From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] Sent: Monday, June 27, 2011 10:57 PM To: common-user@hadoop.apache.org Subject: RE: Why

Re: Performance Tunning

2011-06-28 Thread Michel Segel
Matt, You have 2 threads per core, so your Linux box thinks an 8 core box has16 cores. In my calcs, I tend to take a whole core for TT DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor. Note this is only a starting point.you can always tun

Api migration from 0.19.1 to 0.20.20

2011-06-28 Thread Prashant Sharma
Hi , I have my source code written in 0.19.1 Hadoop API and want to shift it to newer API 0.20.20. Any clue on good documentation on migrating from older version to newer version will be very helpful. Thanks. Prashant This m

HARs without Map/Reduce

2011-06-28 Thread Vitalii Tymchyshyn
Hello. In my application I am using HDFS without map/reduce. Yesterday in this list I get known about har archives. This is great solution for me to handle archived data, so I've decided to create such an archive and test it. The creation worked in Local MapReduce mode, but each file took ~3