Hardware performance from HADOOP cluster

2009-10-14 Thread Usman Waheed
Hi, Is there a way to tell what kind of performance numbers one can expect out of their cluster given a certain set of specs. For example i have 5 nodes in my cluster that all have the following hardware configuration(s): Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread tim robertson
Might it be worth running the http://wiki.apache.org/hadoop/Sort and posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed wrote: > Hi, > > Is there a way to tell what kind of performance numbers one can expect out > of their cluster given a certain set of specs.

Error register getProtocolVersion

2009-10-14 Thread tim robertson
Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the right direc

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread Usman Waheed
Thanks Tim, i will check it out and post my results for comments. -Usman Might it be worth running the http://wiki.apache.org/hadoop/Sort and posting your results for comment? Tim On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed wrote: Hi, Is there a way to tell what kind of performance nu

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread tim robertson
I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G drives, and will do the same soon. Got some issues though so it won't start up... Tim On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed wrote: > Thanks Tim, i will check it out and post my results for com

Project ideas !

2009-10-14 Thread Siddu
Hello Hadoop Users, Me and another friend of mine are looking out for some of the project ideas based on hadoop as a part of our curriculum . Can you give us some pointers please Thanks in advance ! Regards, ~Sid~

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread Usman Waheed
Here are the results i got from my 4 node cluster (correction i noted 5 earlier). One of my nodes out of the 4 is a namenode+datanode both. GENERATE RANDOM DATA Wrote out 40GB of random binary data: Map output records=4088301 The job took 358 seconds. (approximately: 6 minutes). SORT RANDOM GEN

Re: Project ideas !

2009-10-14 Thread sudha sadhasivam
Some of the projects include: 1) Categorise URLS based on domains 2) Content based searching 3) P2P information retrieval 4) Performance enhancements in map-reduce. 5) Sort and shuffle optimisations in MR framework. 6) Enhancements of scheduling strategies in hadoop 7) Document classification 8) Do

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread sudha sadhasivam
Can use terrabyte sort to check up performance G Sudha Sadasivam --- On Wed, 10/14/09, tim robertson wrote: From: tim robertson Subject: Re: Hardware performance from HADOOP cluster To: common-user@hadoop.apache.org Date: Wednesday, October 14, 2009, 3:16 PM I am setting up a new cluster of

Re: DataNode is shutting down

2009-10-14 Thread sudha sadhasivam
May be master / slave file in conf directory is over written with new data nodes address. Instead it should be appended. G Sudha Sadasivam --- On Wed, 10/14/09, yibo820217 wrote: From: yibo820217 Subject: DataNode is shutting down To: core-u...@hadoop.apache.org Date: Wednesday, October 14, 2

Re: Project ideas !

2009-10-14 Thread tim robertson
I am interested to see more spatial processing carried out on hadoop. I have done basic spatial joins intersecting 100s millions of points with 100s thousands of polygons but this is all. It's something I'd like to spend time researching, but don't have that time... could be a nice piece of resear

java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion

2009-10-14 Thread tim robertson
Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error register getProtocolVersion java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion Could someone please point me in the rig

Re: DataNode is shutting down

2009-10-14 Thread Brian Bockelman
Hey, Another possibility is that you have inadvertently put some of the datanode files on a shared directory, such as an NFS mount. I've seen the same problem before on the mailing list (did you search the list archives?) Brian On Oct 14, 2009, at 6:30 AM, sudha sadhasivam wrote: May b

Re: Error register getProtocolVersion

2009-10-14 Thread Brian Bockelman
Hey Tim, Can you see if that goes away if you use the null metrics context (the error appears to be coming from metrics - have you touched them?)? Brian On Oct 14, 2009, at 4:33 AM, tim robertson wrote: Hi all, Using hadoop 0.20.1 I am seeing the following on my namenode startup. 2009-10

Hadoop Job Opportunity - NYC

2009-10-14 Thread David Moccia
Hello All- My name is David Moccia and I am with the HR department here at the Gilt Groupe. We are HUGE supporters of the open source community. We currently have contributors that are employed here. I have attached a job description of a position that we are looking to fill with someone who

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Chris Seline
No, there doesn't seem to be all that much network traffic. Most of the time traffic (measured with nethogs) is about 15-30K/s on the master and slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe 5-10 seconds on a query that takes 10 minutes, but that is still less than wh

Re: Error register getProtocolVersion

2009-10-14 Thread tim robertson
Hi Brian, Thanks for replying.I don't even know what the metrics really are so have never knowingly touched them. I have only done the basic core and site configuration along with master / slaves list and then a namenode format and started the cluster to get this error. All based on the curr

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
I remember having a problem like this at one point, it was related to the mean run time of my tasks, and the rate that the jobtracker could start new tasks. By increasing the split size until the mean run time of my tasks was in the minutes, I was able to drive up the utilization. On Wed, Oct 14

RE: Project ideas !

2009-10-14 Thread Patterson, Josh
Siddu, If this is for an undergraduate class, I would suggest something that allows you to get some work in with basic data structures such as building an inverted index over a few million documents (maybe Wikipedia pages?). You will also need to get a general feel for Hadoop. The University of Wa

Re: Error register getProtocolVersion

2009-10-14 Thread Brian Bockelman
Hey Tim, That was my best guess - looks like you've got the right metrics config. Anyone else have an idea? Brian On Oct 14, 2009, at 9:37 AM, tim robertson wrote: Hi Brian, Thanks for replying.I don't even know what the metrics really are so have never knowingly touched them. I have o

NullPointer on starting NameNode

2009-10-14 Thread Bryn Divey
Hi all, I'm getting the following on initializing my NameNode. The actual line throwing the exception is if (atime != -1) { -> long inodeTime = inode.getAccessTime(); Have I corrupted the fsimage or something? This is on the Cloudera packaging of Hadoop 0.20.1+133. Regards, Bryn

RE: Hadoop User Group (Bay Area) - next Wednesday (Oct 21st) at Yahoo!

2009-10-14 Thread Dekel Tankel
Hi all, RSVP is still open for the next monthly Bay Area Hadoop user group at the Yahoo! Sunnyvale Campus, next Wednesday (Oct 21st), 6PM Registration and Agenda are available here http://www.meetup.com/hadoop/calendar/11532125/ Looking forward to seeing you next week! Dekel

Re: NullPointer on starting NameNode

2009-10-14 Thread Hairong Kuang
This might be caused by https://issues.apache.org/jira/browse/HDFS-686. I will upload a patch there for you to start your NameNode. Hairong On 10/14/09 9:15 AM, "Bryn Divey" wrote: > Hi all, > > I'm getting the following on initializing my NameNode. The actual line > throwing the exception i

Re: NullPointer on starting NameNode

2009-10-14 Thread Todd Lipcon
Hi Bryn, Just to let you know, we've queued the patch Hairong mentioned for the next update to our distribution, due out around the end of this month. Thanks! -Todd On Wed, Oct 14, 2009 at 9:15 AM, Bryn Divey wrote: > Hi all, > > I'm getting the following on initializing my NameNode. The actua

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread Todd Lipcon
This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have you changed the configurations at all? There are some notes on this blog post that might help your performance a bit: http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/ How many map and re

Re: Error register getProtocolVersion

2009-10-14 Thread Todd Lipcon
Hi Tim, I have also seen this error, but it's not fatal. Is this log from just the NameNode or did you tail multiple logs? It seems odd that your namenode would be trying to make an IPC client to itself (port 8020). After you see these messages, does your namenode shut down? Does jps show it run

Re: Hardware performance from HADOOP cluster

2009-10-14 Thread Chris Seline
Unless my calcs are off, that is right in line with the terabyte sort record: 1TB = roughly 1,000,000 MB 1460 nodes 62 seconds 100/1460/62 = 11MB/s per machine but they used 2.5ghz quad core DUAL cpu servers, so they have roughly 2.5x the horsepower as Usman's setup. http://developer.yah

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Chris Seline
That definitely helps a lot! I saw a few people talking about it on the webs, and they say to set the value to Long.MAX_VALUE, but that is not what I have found to be best. I see about 25% improvement at 300MB (3), CPU utilization is up to about 50-70%+, but I am still fine tuning. th

[ANN] HBase-Writer 0.20.2 - Heritrix2 Processor for HBase

2009-10-14 Thread Ryan Smith
Hello web crawlers! The latest release for HBase-Writer is now available for download: http://code.google.com/p/hbase-writer/ This latest version supports Hbase/Hadoop 0.20.x and has a few changes from the previos 0.19.x version. Please check out the project site and README for details. Thanks

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Long Van Nguyen Dinh
Hello again, Could you give me any hint to start with? I have no idea how to get that information. Many thanks, Van On Tue, Oct 13, 2009 at 9:22 PM, Long Van Nguyen Dinh wrote: > Hi all, > > Given a map task, I need to know the IP address of the machine where > that task is running. Is there an

Re: NullPointer on starting NameNode

2009-10-14 Thread bryn
On Wed, 14 Oct 2009 09:53:11 -0700, Hairong Kuang wrote: > This might be caused by > https://issues.apache.org/jira/browse/HDFS-686. I will upload a patch there > for you to start your NameNode. Thanks, Hairong - I'll patch and give it a go.

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Amogh Vasekar
For starters look at any monitoring tool like vaidya, hadoop UI ( ganglia too, haven't read much on it though ). Not sure if you need this for debugging purposes or for some other real-time app.. You should be able to get info on localhost of each of your map tasks in a pretty straightforward wa

Need Info

2009-10-14 Thread shwitzu
Hello Sir! I am new to hadoop. I have a project based on webservices. I have my information in 4 databases with different files in each one of them. Say, images in one, video, documents etc. My task is to develop a web service which accepts the keyword from the client and process the request and

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Long Van Nguyen Dinh
Thanks Amogh. For my application, I want each map task reports to me where it's running. However, I have no idea how to use Java Inetaddress APIs to get that info. Could you explain more? Van On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar wrote: > For starters look at any monitoring tool like va

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Huy Phan
Hi Long, 1. You can make it with a hardcore way: parsing the log file of job tracker to get the information about `task id` and hostname of its node. 2. This is a sample code to get IP Address from hostname using InetAddress: |java.net.InetAddress inetAdd = java.net.InetAddress.getByName("www

Fwd: help with Hadoop custom Writable types implementation

2009-10-14 Thread z3r0c001
I'm trying to implement Writable interface. but not sure how to serialize/write/read data from nested objects in public class StorageClass implements Writable{ public String xStr; public String yStr; public List sStor //omitted ctors @override public void write(DataOutput out) throws IOExcept

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-14 Thread Jason Venner
The value really varies by job and by cluster, the larger the split, the more chance there is that a small number of splits will take much longer to complete than the rest resulting in a long job tail where very little of your cluster is utilized while they complete. The flip side is with very sma

Re: help with Hadoop custom Writable types implementation

2009-10-14 Thread Jeff Zhang
Write: first write the length of your list, and then write the item in list one by one Read: read the length of your list, initialized your list and read the item one by one, and append the item to list I suggest you use array instead of list On Thu, Oct 15, 2009 at 12:01 PM, z3r0c001 wrote:

Re: help with Hadoop custom Writable types implementation

2009-10-14 Thread Amogh Vasekar
Hi, AFAIK readline is not recommended on DataInput types. Also, look into writableutils to see if something there may be used. Hope this helps. Amogh On 10/15/09 9:31 AM, "z3r0c001" wrote: I'm trying to implement Writable interface. but not sure how to serialize/write/read data from nested ob

Re: How to get IP address of the machine where map task runs

2009-10-14 Thread Amogh Vasekar
InetAddress.getLocalHost() should give you that. If you are planning to make some decisions based on this, please do account for conditions arising from speculative executions ( they caused me some amount of trouble when I was designing my app ) Thanks, Amogh On 10/15/09 8:15 AM, "Long Van Ng

Re: Need Info

2009-10-14 Thread sudha sadhasivam
Dear Shwitzu   The steps are listed below:   Kindly go through wordcount and multifile word count for you project.   Modify the program to list the files containing the keywords along with fine names. Use file names as keys.   Store the files in 4 different input directories – one for each file ty