Re: help with Hadoop custom Writable types implementation

2009-10-15 Thread Jeff Zhang
Write:
first write the length of your list, and then write the item in list one by
one

Read:
read the length of your list, initialized your list and read the item one by
one, and append the item to list

I suggest you use array instead of list



On Thu, Oct 15, 2009 at 12:01 PM, z3r0c001 ceo.co...@gmail.com wrote:

 I'm trying to implement Writable interface. but not sure how to
 serialize/write/read data from nested objects in

 public class StorageClass implements Writable{

 public String xStr;
 public String yStr;

 public ListSubStorage sStor

 //omitted ctors


 @override
 public void write(DataOutput out) throws IOException{
out.writeChars(xStr);
out.WriteChars(yStr);

//WHAT SHOULD I DO FOR ListSubStorage

 }

 @override
 public void readFields(DataInput in) throws IOException{
xStr = in.readLine();
yStr = in.readLine();

//WHAT SHOULD I DO FOR ListSubStorage
 }

 }

 public class SubStorage{
public String x;
public String y;
 }



Re: How to get IP address of the machine where map task runs

2009-10-15 Thread Amogh Vasekar
InetAddress.getLocalHost() should give you that. If you are planning to make 
some decisions based on this, please do account for conditions arising from 
speculative executions ( they caused me some amount of trouble when I was 
designing my app )

Thanks,
Amogh


On 10/15/09 8:15 AM, Long Van Nguyen Dinh munt...@gmail.com wrote:

Thanks Amogh. For my application, I want each map task reports to me
where it's running. However, I have no idea how to use Java
Inetaddress APIs to get that info. Could you explain more?

Van

On Wed, Oct 14, 2009 at 2:16 PM, Amogh Vasekar am...@yahoo-inc.com wrote:
 For starters look at any monitoring tool like vaidya, hadoop UI ( ganglia 
 too, haven't read much on it though ). Not sure if you need this for 
 debugging purposes or for some other real-time app.. You should be able to 
 get info on localhost of each of your map tasks in a pretty straightforward 
 way using Java Inetaddress APIs( and use that info for search?)

 Thanks,
 Amogh


 On 10/15/09 12:11 AM, Long Van Nguyen Dinh munt...@gmail.com wrote:

 Hello again,

 Could you give me any hint to start with? I have no idea how to get
 that information.

 Many thanks,
 Van

 On Tue, Oct 13, 2009 at 9:22 PM, Long Van Nguyen Dinh munt...@gmail.com 
 wrote:
 Hi all,

 Given a map task, I need to know the IP address of the machine where
 that task is running. Is there any existing method to get that
 information?

 Thank you,
 Van






Re: Need Info

2009-10-15 Thread sudha sadhasivam
Dear Shwitzu
 
The steps are listed below:
 
Kindly go through wordcount and multifile word count for you project.
 
Modify the program to list the files containing the keywords along with fine 
names. Use file names as keys.
 
Store the files in 4 different input directories – one for each file type if 
needed. Else you can also have it in a single input directory.
 
Use word count example with extensions suggested to retrieve file names having 
the keywords and store the result in output directory or display the links.
 
Map – parallelized reading of multiple files – 
Input key-value pair is filename–filecontents
Output key-value pair is filename – keyword and count.
 
Reduce – combining output from key-value pairs of map function 
 
Input key-value pair is filename – keyword and count.
Output key-value pairs is keyword – filenames having the keywords

The answers to your questions are:
1) How should I start with the design?
Identify the files to be saved in the HDFS input disrectory.
Go through the word count example.
2) Upload all the files and create Map, Reduce and Driver code and
once I run my application will it automatically go the file system and get
back the results to me?
Move all the files from local file system to HDFS / save it directly to HDFS by 
using suitable DFS command like copyfromlocal() - Go through DFS commands
3) How do i handle the binary data? I want to store binary format data using
MTOM in my databse.
It can be handled in the same way as a conventional file
 
G Sudha Sadasivam


t...@gmail.com wrote:


From: shwitzu shwi...@gmail.com
Subject: Need Info
To: core-u...@hadoop.apache.org
Date: Thursday, October 15, 2009, 7:19 AM



Hello Sir!

I am new to hadoop. I have a project  based on webservices. I have my
information in 4 databases with different files in each one of them. Say,
images in one, video, documents etc. My task is to develop a web service
which accepts the keyword from the client and process the request and send
back the actual requested file back to the user. Now I have to use Hadoop
distributed file system in this project.

I have the following questions:

1) How should I start with the design?
2)  Should I upload all the files and create Map, Reduce and Driver code and
once I run my application will it automatically go the file system and get
back the results to me?
3) How do i handle the binary data? I want to store binary format data using
MTOM in my databse.

Please let me know how I should proceed. I dont know much about this hadoop
and am I searching for some help. It would be great if you could assist me.
Thanks again

-- 
View this message in context: 
http://www.nabble.com/Need-Info-tp25901902p25901902.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.




  

Re: Error register getProtocolVersion

2009-10-15 Thread tim robertson
Thanks for the info Todd,

 I have also seen this error, but it's not fatal.
Ok, I'll ignore it for the time being.

 Is this log from just the NameNode or did you tail multiple logs? It
 seems odd that your namenode would be trying to make an IPC client to
 itself (port 8020).

Just the name node.

 After you see these messages, does your namenode shut down?

It kind of just hangs really (Already tried 9 times, Already tried 10 times etc)

 Does jps show it running? Is the web interface available at port 50070?

No, it never is available.

I'll keep hunting for some config error.

Cheers
Tim



 -Todd


 On Wed, Oct 14, 2009 at 2:33 AM, tim robertson
 timrobertson...@gmail.com wrote:
 Hi all,

 Using hadoop 0.20.1 I am seeing the following on my namenode startup.

 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error
 register getProtocolVersion
 java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion

 Could someone please point me in the right direction for diagnosing
 where I have gone wrong?

 Thanks
 Tim



 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = cluma.gbif.org/192.38.28.77
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.1
 STARTUP_MSG:   build =
 http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
 /
 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=NameNode, port=8020
 2009-10-14 11:09:53,013 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 cluma.gbif.clu/192.168.76.254:8020
 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2009-10-14 11:09:53,019 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,056 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 supergroup=supergroup
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2009-10-14 11:09:53,064 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,065 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2009-10-14 11:09:53,090 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files under
 construction = 0
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 loaded in 0 seconds.
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Edits file
 /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
 2009-10-14 11:09:53,098 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 saved in 0 seconds.
 2009-10-14 11:09:53,113 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading
 FSImage in 72 msecs
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 under-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 over-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Leaving safe mode after 0 secs.
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Network topology has 0 racks and 0 datanodes
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* UnderReplicatedBlocks has 0 blocks
 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2009-10-14 11:09:53,242 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open()
 is -1. Opening the listener on 50070
 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50070
 webServer.getConnectors()[0].getLocalPort() returned 50070
 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer: Jetty
 bound to port 50070
 2009-10-14 11:09:53,243 INFO org.mortbay.log: jetty-6.1.14
 2009-10-14 11:09:53,433 INFO org.mortbay.log: Started
 

Re: Error register getProtocolVersion

2009-10-15 Thread Todd Lipcon
Hi Tim,

jps should show it running if it's hung.

In that case, see if you can get a stack trace using jstack pid.
Paste that trace and we may be able to figure out where it's hung up.

-Todd

On Thu, Oct 15, 2009 at 12:12 AM, tim robertson
timrobertson...@gmail.com wrote:
 Thanks for the info Todd,

 I have also seen this error, but it's not fatal.
 Ok, I'll ignore it for the time being.

 Is this log from just the NameNode or did you tail multiple logs? It
 seems odd that your namenode would be trying to make an IPC client to
 itself (port 8020).

 Just the name node.

 After you see these messages, does your namenode shut down?

 It kind of just hangs really (Already tried 9 times, Already tried 10 times 
 etc)

 Does jps show it running? Is the web interface available at port 50070?

 No, it never is available.

 I'll keep hunting for some config error.

 Cheers
 Tim



 -Todd


 On Wed, Oct 14, 2009 at 2:33 AM, tim robertson
 timrobertson...@gmail.com wrote:
 Hi all,

 Using hadoop 0.20.1 I am seeing the following on my namenode startup.

 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error
 register getProtocolVersion
 java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion

 Could someone please point me in the right direction for diagnosing
 where I have gone wrong?

 Thanks
 Tim



 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = cluma.gbif.org/192.38.28.77
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.1
 STARTUP_MSG:   build =
 http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
 /
 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=NameNode, port=8020
 2009-10-14 11:09:53,013 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 cluma.gbif.clu/192.168.76.254:8020
 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2009-10-14 11:09:53,019 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,056 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 supergroup=supergroup
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2009-10-14 11:09:53,064 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,065 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2009-10-14 11:09:53,090 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files under
 construction = 0
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 loaded in 0 seconds.
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Edits file
 /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
 2009-10-14 11:09:53,098 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 saved in 0 seconds.
 2009-10-14 11:09:53,113 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading
 FSImage in 72 msecs
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 under-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 over-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Leaving safe mode after 0 secs.
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Network topology has 0 racks and 0 datanodes
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* UnderReplicatedBlocks has 0 blocks
 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to
 org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via
 org.mortbay.log.Slf4jLog
 2009-10-14 11:09:53,242 INFO org.apache.hadoop.http.HttpServer: Port
 returned by webServer.getConnectors()[0].getLocalPort() before open()
 is -1. Opening the listener on 50070
 2009-10-14 11:09:53,243 INFO org.apache.hadoop.http.HttpServer:
 listener.getLocalPort() returned 50070
 

Re: Error register getProtocolVersion

2009-10-15 Thread Eason.Lee
2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error
register getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
   at
org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
   at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89)
   at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)

I have got this error when i start my mapreduce
but everything seems OK
and i can still running my mapreduce job
Any suggestion to deal with this error?

2009/10/15 Todd Lipcon t...@cloudera.com

 Hi Tim,

 jps should show it running if it's hung.

 In that case, see if you can get a stack trace using jstack pid.
 Paste that trace and we may be able to figure out where it's hung up.

 -Todd

 On Thu, Oct 15, 2009 at 12:12 AM, tim robertson
 timrobertson...@gmail.com wrote:
  Thanks for the info Todd,
 
  I have also seen this error, but it's not fatal.
  Ok, I'll ignore it for the time being.
 
  Is this log from just the NameNode or did you tail multiple logs? It
  seems odd that your namenode would be trying to make an IPC client to
  itself (port 8020).
 
  Just the name node.
 
  After you see these messages, does your namenode shut down?
 
  It kind of just hangs really (Already tried 9 times, Already tried 10
 times etc)
 
  Does jps show it running? Is the web interface available at port 50070?
 
  No, it never is available.
 
  I'll keep hunting for some config error.
 
  Cheers
  Tim
 
 
 
  -Todd
 
 
  On Wed, Oct 14, 2009 at 2:33 AM, tim robertson
  timrobertson...@gmail.com wrote:
  Hi all,
 
  Using hadoop 0.20.1 I am seeing the following on my namenode startup.
 
  2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error
  register getProtocolVersion
  java.lang.IllegalArgumentException: Duplicate
 metricsName:getProtocolVersion
 
  Could someone please point me in the right direction for diagnosing
  where I have gone wrong?
 
  Thanks
  Tim
 
 
 
  /
  STARTUP_MSG: Starting NameNode
  STARTUP_MSG:   host = cluma.gbif.org/192.38.28.77
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.1
  STARTUP_MSG:   build =
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
  -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
  /
  2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
  Initializing RPC Metrics with hostName=NameNode, port=8020
  2009-10-14 11:09:53,013 INFO
  org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
  cluma.gbif.clu/192.168.76.254:8020
  2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
  Initializing JVM Metrics with processName=NameNode, sessionId=null
  2009-10-14 11:09:53,019 INFO
  org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
  Initializing NameNodeMeterics using context
  object:org.apache.hadoop.metrics.spi.NullContext
  2009-10-14 11:09:53,056 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
  fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
  2009-10-14 11:09:53,057 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
  supergroup=supergroup
  2009-10-14 11:09:53,057 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
  isPermissionEnabled=true
  2009-10-14 11:09:53,064 INFO
  org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
  Initializing FSNamesystemMetrics using context
  object:org.apache.hadoop.metrics.spi.NullContext
  2009-10-14 11:09:53,065 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
  FSNamesystemStatusMBean
  2009-10-14 11:09:53,090 INFO
  org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
  2009-10-14 11:09:53,094 INFO
  org.apache.hadoop.hdfs.server.common.Storage: Number of files under
  construction = 0
  2009-10-14 11:09:53,094 INFO
  org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
  loaded in 0 seconds.
  2009-10-14 11:09:53,094 INFO
  org.apache.hadoop.hdfs.server.common.Storage: Edits file
  /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
  2009-10-14 11:09:53,098 INFO
  org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
  saved in 0 seconds.
  2009-10-14 11:09:53,113 INFO
  org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading
  FSImage in 72 msecs
  2009-10-14 11:09:53,114 INFO
  

Re: Hardware performance from HADOOP cluster

2009-10-15 Thread Usman Waheed

Hi Todd,

Some changes have been applied to the cluster based on the documentation 
(URL) you noted below,
like file descriptor settings and io.file.buffer.size. I will check out 
the other settings which I haven't applied yet.


My map/reduce slot settings from my hadoop-site.xml and 
hadoop-default.xml on all nodes in the cluster.


_*hadoop-site.xml
*_mapred.tasktracker.task.maximum = 2
mapred.tasktracker.map.tasks.maximum = 8
mapred.tasktracker.reduce.tasks.maximum = 8
_*
hadoop-default.xml
*_mapred.map.tasks = 2
mapred.reduce.tasks = 1

Thanks,
Usman



This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
you changed the configurations at all? There are some notes on this
blog post that might help your performance a bit:

http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

How many map and reduce slots did you configure for the daemons? If
you have Ganglia installed you can usually get a good idea of whether
you're using your resources well by looking at the graphs while
running a job like this sort.

-Todd

On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote:
  

Here are the results i got from my 4 node cluster (correction i noted 5
earlier). One of my nodes out of the 4 is a namenode+datanode both.

GENERATE RANDOM DATA
Wrote out 40GB of random binary data:
Map output records=4088301
The job took 358 seconds. (approximately: 6 minutes).

SORT RANDOM GENERATED DATA
Map output records=4088301
Reduce input records=4088301
The job took 2136 seconds. (approximately: 35 minutes).

VALIDATION OF SORTED DATA
The job took 183 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

It would be interesting to see what performance numbers others with a
similar setup have obtained.

Thanks,
Usman



I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
issues though so it won't start up...

Tim


On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote:

  

Thanks Tim, i will check it out and post my results for comments.
-Usman



Might it be worth running the http://wiki.apache.org/hadoop/Sort and
posting your results for comment?

Tim


On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com wrote:


  

Hi,

Is there a way to tell what kind of performance numbers one can expect
out
of their cluster given a certain set of specs.

For example i have 5 nodes in my cluster that all have the following
hardware configuration(s):
Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.

Thanks,
Usman





  

  



  




Re: detecting stalled daemons?

2009-10-15 Thread Steve Loughran

Edward Capriolo wrote:


I know there is a Jira open to add life cycle methods to each hadoop
component that can be polled for progress. I dont know the # off hand.



HDFS-326 https://issues.apache.org/jira/browse/HDFS-326 the code has its 
own branch.


This is still something I'm working on, the code works, all the tests 
work, but there are some quirks with JobTracker startup now that it 
blocks waiting for the filesystem to come up that I'm not happy with;  I 
need to add some new tests/mechanisms to shut down a service while it is 
still starting up, which includes interrupting the JT and TT.


You can get RPMs with all this stuff packaged up for use from 
http://smartfrog.org/ , with the caveat that it's still fairly unstable.


I am currently work on the other side of the equation, integration with 
multiple cloud infrastructures, with all the fun testing issues that follow:

http://www.1060.org/blogxter/entry?publicid=12CE2B62F71239349F3E9903EAE9D1F0


* The simplest liveness test for any of the workers right now is to hit 
their HTTP pages, its the classic happy test. We can and should extend 
this with more self-tests, some equivalent of Axis's happy.jsp. The nice 
thing about these is they integrate well with all the existing web page 
monitoring tools, though I should warn that the same tooling that tracks 
and reports the health of a four-way app server doesn't really scale to 
keeping an eye on 3000 task trackers. It's not the monitoring, but the 
reporting.


* Detecting failures of TTs and DNs is kind of tricky too; it's really 
the namenode and jobtracker that know best. We need to get some 
reporting in there so that when either of the masters think that one of 
their workers is playing up, they report it to whatever plugin wants to 
know.


* Handling failures of VMs is very different from physical machines. You 
just kill the VM and restart a new one. We don't need all the 
blacklisting stuff, just some infrastructure operations and various 
notifications to the ops team.


-steve







Can't I use hive? What does this exception mean?

2009-10-15 Thread 杨卓荦
I have already installed hadoop correctly.
but what does that mean?

~/hive/build/dist/bin$ ./hive
Hive history 
file=/tmp/yangzhuoluo/hive_job_log_yangzhuoluo_200910151919_995767046.txt
hive show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected 
exception caught.
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


Can anybody tell me what does the exception mean? Thank you all very much.



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/

My Hadoop World Report

2009-10-15 Thread Mikio Uzawa
Hi all,

My Hadoop World Report has got 4th position in page view ranking of
ITmedia who's a very large Japanese media rather than CNet in here.
Please scroll down and view the right pain of below URL. You can find the
key word Hadoop in list.
However, it's a Japanese content, sorry.

http://www.itmedia.co.jp/enterprise/articles/0910/15/news011.html 

/mikio uzawa





Re: Error register getProtocolVersion

2009-10-15 Thread tim robertson
Hi Todd, Brian

Thanks for your comments.
I got to the bottom of the problem.  A netstat showed me that all the
DNs could connect to the NN but the firewall was stopping the NN
complete the http handshake to itself.

It's all up and running now.

Thanks,
Tim


On Thu, Oct 15, 2009 at 10:23 AM, Todd Lipcon t...@cloudera.com wrote:
 Hi Tim,

 jps should show it running if it's hung.

 In that case, see if you can get a stack trace using jstack pid.
 Paste that trace and we may be able to figure out where it's hung up.

 -Todd

 On Thu, Oct 15, 2009 at 12:12 AM, tim robertson
 timrobertson...@gmail.com wrote:
 Thanks for the info Todd,

 I have also seen this error, but it's not fatal.
 Ok, I'll ignore it for the time being.

 Is this log from just the NameNode or did you tail multiple logs? It
 seems odd that your namenode would be trying to make an IPC client to
 itself (port 8020).

 Just the name node.

 After you see these messages, does your namenode shut down?

 It kind of just hangs really (Already tried 9 times, Already tried 10 times 
 etc)

 Does jps show it running? Is the web interface available at port 50070?

 No, it never is available.

 I'll keep hunting for some config error.

 Cheers
 Tim



 -Todd


 On Wed, Oct 14, 2009 at 2:33 AM, tim robertson
 timrobertson...@gmail.com wrote:
 Hi all,

 Using hadoop 0.20.1 I am seeing the following on my namenode startup.

 2009-10-14 11:09:54,232 INFO org.apache.hadoop.ipc.Server: Error
 register getProtocolVersion
 java.lang.IllegalArgumentException: Duplicate 
 metricsName:getProtocolVersion

 Could someone please point me in the right direction for diagnosing
 where I have gone wrong?

 Thanks
 Tim



 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = cluma.gbif.org/192.38.28.77
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.1
 STARTUP_MSG:   build =
 http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
 /
 2009-10-14 11:09:53,010 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=NameNode, port=8020
 2009-10-14 11:09:53,013 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 cluma.gbif.clu/192.168.76.254:8020
 2009-10-14 11:09:53,015 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2009-10-14 11:09:53,019 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics:
 Initializing NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,056 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 fsOwner=root,root,bin,daemon,sys,adm,disk,wheel
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 supergroup=supergroup
 2009-10-14 11:09:53,057 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2009-10-14 11:09:53,064 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 Initializing FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-10-14 11:09:53,065 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 2009-10-14 11:09:53,090 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files = 1
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Number of files under
 construction = 0
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 loaded in 0 seconds.
 2009-10-14 11:09:53,094 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Edits file
 /hadoop/name/current/edits of size 4 edits # 0 loaded in 0 seconds.
 2009-10-14 11:09:53,098 INFO
 org.apache.hadoop.hdfs.server.common.Storage: Image file of size 94
 saved in 0 seconds.
 2009-10-14 11:09:53,113 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Finished loading
 FSImage in 72 msecs
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Total number of
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of invalid
 blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 under-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Number of
 over-replicated blocks = 0
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Leaving safe mode after 0 secs.
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* Network topology has 0 racks and 0 datanodes
 2009-10-14 11:09:53,114 INFO org.apache.hadoop.hdfs.StateChange:
 STATE* UnderReplicatedBlocks has 0 blocks
 2009-10-14 11:09:53,198 INFO org.mortbay.log: Logging to
 

normal hadoop errors?

2009-10-15 Thread Patrick Angeles
I got the following error while running the example sort program (hadoop
0.20) on a brand new Hadoop cluster (using the Cloudera distro). The job
seems to have recovered. However I'm wondering if this is normal or should I
be checking for something.



attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink 10.10.10.52:50010

attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient:
Abandoning block blk_-7778196938228518311_13172

attempt_200910051513_0009_r_05_0: 09/10/15 09:53:52 INFO hdfs.DFSClient:
Waiting to find target node: 10.10.10.56:50010

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:01 INFO hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink 10.10.10.55:50010

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:01 INFO hdfs.DFSClient:
Abandoning block blk_-7309503129247220072_13172

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:09 INFO hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink 10.10.10.52:50010

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:09 INFO hdfs.DFSClient:
Abandoning block blk_3948102363851753370_13172

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:15 INFO hdfs.DFSClient:
Exception in createBlockOutputStream java.io.IOException: Bad connect ack
with firstBadLink 10.10.10.55:50010

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:15 INFO hdfs.DFSClient:
Abandoning block blk_-9105283762069697302_13172

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient:
DataStreamer Exception: java.io.IOException: Unable to create new block.

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2813)

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient:
Error Recovery for block blk_-9105283762069697302_13172 bad datanode[1]
nodes == null

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN hdfs.DFSClient:
Could not get block locations. Source file
/user/hadoop/sort_out/_temporary/_attempt_200910051513_0009_r_05_0/part-5
- Aborting...

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 WARN
mapred.TaskTracker: Error running child

attempt_200910051513_0009_r_05_0: java.io.IOException: Bad connect ack
with firstBadLink 10.10.10.55:50010

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)

attempt_200910051513_0009_r_05_0: at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

attempt_200910051513_0009_r_05_0: 09/10/15 09:54:21 INFO
mapred.TaskRunner: Runnning cleanup for the task

09/10/15 09:54:37 INFO mapred.JobClient:  map 100% reduce 82%

09/10/15 09:54:41 INFO mapred.JobClient:  map 100% reduce 83%

09/10/15 09:54:43 INFO mapred.JobClient: Task Id :
attempt_200910051513_0009_r_17_0, Status : FAILED

java.io.IOException: Bad connect ack with firstBadLink 10.10.10.56:50010

at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)

at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)

at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)

at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)


attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN
conf.Configuration:
/h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a
attempt to override final parameter: hadoop.tmp.dir;  Ignoring.

attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN
conf.Configuration:
/h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a
attempt to override final parameter:
hadoop.rpc.socket.factory.class.default;  Ignoring.

attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN
conf.Configuration:
/h3/hadoop/mapred/hadoop/taskTracker/jobcache/job_200910051513_0009/attempt_200910051513_0009_r_17_0/job.xml:a
attempt to override final parameter: tasktracker.http.threads;  Ignoring.

attempt_200910051513_0009_r_17_0: 09/10/15 09:47:48 WARN
conf.Configuration:

Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-15 Thread Chris Seline
Thanks for your help Jason. I actually did reduce the heap size to 400M 
and it sped things up a few percent. From my experience with jvm's, if 
you can handle lower amounts of heap your app will run faster because GC 
is more efficient for smaller garbage collections (which is also why 
using incremental garbage collection is often desired in web apps). But, 
the contrast to that is if you give it too little heap it can choke your 
app and run the garbage collection too often, sort of like it is 
drowning and gasping for breath.



Jason Venner wrote:

The value really varies by job and by cluster, the larger the split, the
more chance there is that a small number of splits will take much longer to
complete than the rest resulting in a long job tail where very little of
your cluster is utilized while they complete.

The flip side is with very small task the overhead, startup time and
co-ordination latency (which has been improved) can cause in efficient
utilization of your cluster resource.

If you really want to drive up your CPU utilization, reduce your per task
memory size to the bare minimum, and your JVM's will consume massive amounts
of CPU doing garbage collection :) It happend at a place I worked where ~60%
of the job cpu was garbage collection.


On Wed, Oct 14, 2009 at 11:26 AM, Chris Seline ch...@searchles.com wrote:

  

That definitely helps a lot! I saw a few people talking about it on the
webs, and they say to set the value to Long.MAX_VALUE, but that is not what
I have found to be best. I see about 25% improvement at 300MB (3),
CPU utilization is up to about 50-70%+, but I am still fine tuning.


thanks!

Chris

Jason Venner wrote:



I remember having a problem like this at one point, it was related to the
mean run time of my tasks, and the rate that the jobtracker could start
new
tasks.

By increasing the split size until the mean run time of my tasks was in
the
minutes, I was able to drive up the utilization.


On Wed, Oct 14, 2009 at 7:31 AM, Chris Seline ch...@searchles.com
wrote:



  

No, there doesn't seem to be all that much network traffic. Most of the
time traffic (measured with nethogs) is about 15-30K/s on the master and
slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
5-10 seconds on a query that takes 10 minutes, but that is still less
than
what I see in scp transfers on EC2, which is typically about 30 MB/s.

thanks

Chris


Jason Venner wrote:





are your network interface or the namenode/jobtracker/datanodes
saturated


On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com
wrote:





  

I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of
11
c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
available with 20 compute units and 4x 400gb disks.

I wrote some scripts to test many (100's) of configurations running a
particular Hive query to try to make it as fast as possible, but no
matter
what I don't seem to be able to get above roughly 45% cpu utilization
on
the
slaves, and not more than about 1.5% wait state. I have also measured
network traffic and there don't seem to be bottlenecks there at all.

Here are some typical CPU utilization lines from top on a slave when
running a query:
Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
 0.7%st
Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
 0.5%st
Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
 1.0%st
Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
 0.7%st
Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
 1.3%st

It seems like if tuned properly, I should be able to max out my cpu (or
my
disk) and get roughly twice the performance I am seeing now. None of
the
parameters I am tuning seem to be able to achieve this. Adjusting
mapred.map.tasks and mapred.reduce.tasks does help somewhat, and
setting
the
io.file.buffer.size to 4096 does better than the default, but the rest
of
the values I am testing seem to have little positive  effect.

These are the parameters I am testing, and the values tried:

io.sort.factor=2,3,4,5,10,15,20,25,30,50,100



mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99
io.bytes.per.checksum=256,512,1024,2048,4192
mapred.output.compress=true,false
hive.exec.compress.intermediate=true,false



hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.97,0.98,0.99

mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200



mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
mapred.merge.recordsBeforeProgress=5000,1,2,3



mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95,0.99




Re: Hardware performance from HADOOP cluster

2009-10-15 Thread tim robertson
Hi Usmam,

So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
high memory jobs so picked 4 only)
[9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]

Using your template for stats, I get the following with no tuning:

GENERATE RANDOM DATA
Wrote out 90GB of random binary data:
Map output records=9198009
The job took 350 seconds. (approximately: 6 minutes)

SORT RANDOM GENERATED DATA
Map output records= 9197821
Reduce input records=9197821
The job took 2176 seconds. (approximately: 36mins).

So pretty similar to your initial benchmark.  I will tune it a bit
tomorrow and rerun.

If you spent time tuning your cluster and it was successful, please
can you share your config?

Cheers,
Tim





On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote:
 Hi Todd,

 Some changes have been applied to the cluster based on the documentation
 (URL) you noted below,
 like file descriptor settings and io.file.buffer.size. I will check out the
 other settings which I haven't applied yet.

 My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml
 on all nodes in the cluster.

 _*hadoop-site.xml
 *_mapred.tasktracker.task.maximum = 2
 mapred.tasktracker.map.tasks.maximum = 8
 mapred.tasktracker.reduce.tasks.maximum = 8
 _*
 hadoop-default.xml
 *_mapred.map.tasks = 2
 mapred.reduce.tasks = 1

 Thanks,
 Usman


 This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
 you changed the configurations at all? There are some notes on this
 blog post that might help your performance a bit:


 http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

 How many map and reduce slots did you configure for the daemons? If
 you have Ganglia installed you can usually get a good idea of whether
 you're using your resources well by looking at the graphs while
 running a job like this sort.

 -Todd

 On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote:


 Here are the results i got from my 4 node cluster (correction i noted 5
 earlier). One of my nodes out of the 4 is a namenode+datanode both.

 GENERATE RANDOM DATA
 Wrote out 40GB of random binary data:
 Map output records=4088301
 The job took 358 seconds. (approximately: 6 minutes).

 SORT RANDOM GENERATED DATA
 Map output records=4088301
 Reduce input records=4088301
 The job took 2136 seconds. (approximately: 35 minutes).

 VALIDATION OF SORTED DATA
 The job took 183 seconds.
 SUCCESS! Validated the MapReduce framework's 'sort' successfully.

 It would be interesting to see what performance numbers others with a
 similar setup have obtained.

 Thanks,
 Usman



 I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
 cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
 issues though so it won't start up...

 Tim


 On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote:



 Thanks Tim, i will check it out and post my results for comments.
 -Usman



 Might it be worth running the http://wiki.apache.org/hadoop/Sort and
 posting your results for comment?

 Tim


 On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com
 wrote:




 Hi,

 Is there a way to tell what kind of performance numbers one can
 expect
 out
 of their cluster given a certain set of specs.

 For example i have 5 nodes in my cluster that all have the following
 hardware configuration(s):
 Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.

 Thanks,
 Usman



















Re: Hardware performance from HADOOP cluster

2009-10-15 Thread Usman Waheed

Hi Tim,

Thanks much for sharing the info.
I will most certainly share my configuration settings after applying 
some tuning at my end.

Will let you know the results on this email list.

Thanks,
Usman



Hi Usmam,

So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
high memory jobs so picked 4 only)
[9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA drives]

Using your template for stats, I get the following with no tuning:

GENERATE RANDOM DATA
Wrote out 90GB of random binary data:
Map output records=9198009
The job took 350 seconds. (approximately: 6 minutes)

SORT RANDOM GENERATED DATA
Map output records= 9197821
Reduce input records=9197821
The job took 2176 seconds. (approximately: 36mins).

So pretty similar to your initial benchmark.  I will tune it a bit
tomorrow and rerun.

If you spent time tuning your cluster and it was successful, please
can you share your config?

Cheers,
Tim





On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote:
  

Hi Todd,

Some changes have been applied to the cluster based on the documentation
(URL) you noted below,
like file descriptor settings and io.file.buffer.size. I will check out the
other settings which I haven't applied yet.

My map/reduce slot settings from my hadoop-site.xml and hadoop-default.xml
on all nodes in the cluster.

_*hadoop-site.xml
*_mapred.tasktracker.task.maximum = 2
mapred.tasktracker.map.tasks.maximum = 8
mapred.tasktracker.reduce.tasks.maximum = 8
_*
hadoop-default.xml
*_mapred.map.tasks = 2
mapred.reduce.tasks = 1

Thanks,
Usman




This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
you changed the configurations at all? There are some notes on this
blog post that might help your performance a bit:


http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/

How many map and reduce slots did you configure for the daemons? If
you have Ganglia installed you can usually get a good idea of whether
you're using your resources well by looking at the graphs while
running a job like this sort.

-Todd

On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote:

  

Here are the results i got from my 4 node cluster (correction i noted 5
earlier). One of my nodes out of the 4 is a namenode+datanode both.

GENERATE RANDOM DATA
Wrote out 40GB of random binary data:
Map output records=4088301
The job took 358 seconds. (approximately: 6 minutes).

SORT RANDOM GENERATED DATA
Map output records=4088301
Reduce input records=4088301
The job took 2136 seconds. (approximately: 35 minutes).

VALIDATION OF SORTED DATA
The job took 183 seconds.
SUCCESS! Validated the MapReduce framework's 'sort' successfully.

It would be interesting to see what performance numbers others with a
similar setup have obtained.

Thanks,
Usman




I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
issues though so it won't start up...

Tim


On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com wrote:


  

Thanks Tim, i will check it out and post my results for comments.
-Usman




Might it be worth running the http://wiki.apache.org/hadoop/Sort and
posting your results for comment?

Tim


On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com
wrote:



  

Hi,

Is there a way to tell what kind of performance numbers one can
expect
out
of their cluster given a certain set of specs.

For example i have 5 nodes in my cluster that all have the following
hardware configuration(s):
Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same rack.

Thanks,
Usman






  

  

  



  




Hardware Setup

2009-10-15 Thread Alex Newman
  So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.


Re: Hardware performance from HADOOP cluster

2009-10-15 Thread Patrick Angeles
Hi Tim,
I assume those are single proc machines?

I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
have dual quad Nehalems (2.26Ghz).

On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
timrobertson...@gmail.comwrote:

 Hi Usmam,

 So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
 high memory jobs so picked 4 only)
 [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA
 drives]

 Using your template for stats, I get the following with no tuning:

 GENERATE RANDOM DATA
 Wrote out 90GB of random binary data:
 Map output records=9198009
 The job took 350 seconds. (approximately: 6 minutes)

 SORT RANDOM GENERATED DATA
 Map output records= 9197821
 Reduce input records=9197821
 The job took 2176 seconds. (approximately: 36mins).

 So pretty similar to your initial benchmark.  I will tune it a bit
 tomorrow and rerun.

 If you spent time tuning your cluster and it was successful, please
 can you share your config?

 Cheers,
 Tim





 On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote:
  Hi Todd,
 
  Some changes have been applied to the cluster based on the documentation
  (URL) you noted below,
  like file descriptor settings and io.file.buffer.size. I will check out
 the
  other settings which I haven't applied yet.
 
  My map/reduce slot settings from my hadoop-site.xml and
 hadoop-default.xml
  on all nodes in the cluster.
 
  _*hadoop-site.xml
  *_mapred.tasktracker.task.maximum = 2
  mapred.tasktracker.map.tasks.maximum = 8
  mapred.tasktracker.reduce.tasks.maximum = 8
  _*
  hadoop-default.xml
  *_mapred.map.tasks = 2
  mapred.reduce.tasks = 1
 
  Thanks,
  Usman
 
 
  This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
  you changed the configurations at all? There are some notes on this
  blog post that might help your performance a bit:
 
 
 
 http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
 
  How many map and reduce slots did you configure for the daemons? If
  you have Ganglia installed you can usually get a good idea of whether
  you're using your resources well by looking at the graphs while
  running a job like this sort.
 
  -Todd
 
  On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote:
 
 
  Here are the results i got from my 4 node cluster (correction i noted 5
  earlier). One of my nodes out of the 4 is a namenode+datanode both.
 
  GENERATE RANDOM DATA
  Wrote out 40GB of random binary data:
  Map output records=4088301
  The job took 358 seconds. (approximately: 6 minutes).
 
  SORT RANDOM GENERATED DATA
  Map output records=4088301
  Reduce input records=4088301
  The job took 2136 seconds. (approximately: 35 minutes).
 
  VALIDATION OF SORTED DATA
  The job took 183 seconds.
  SUCCESS! Validated the MapReduce framework's 'sort' successfully.
 
  It would be interesting to see what performance numbers others with a
  similar setup have obtained.
 
  Thanks,
  Usman
 
 
 
  I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
  cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
  issues though so it won't start up...
 
  Tim
 
 
  On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com
 wrote:
 
 
 
  Thanks Tim, i will check it out and post my results for comments.
  -Usman
 
 
 
  Might it be worth running the http://wiki.apache.org/hadoop/Sortand
  posting your results for comment?
 
  Tim
 
 
  On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com
  wrote:
 
 
 
 
  Hi,
 
  Is there a way to tell what kind of performance numbers one can
  expect
  out
  of their cluster given a certain set of specs.
 
  For example i have 5 nodes in my cluster that all have the
 following
  hardware configuration(s):
  Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same
 rack.
 
  Thanks,
  Usman
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 



job during datanode decommisioning?

2009-10-15 Thread Murali Krishna. P
Hi, 
Is it safe to run the jobs during the datanode decommissioning? Around 10% of 
the total datanodes are being decommissioned and i want to run a job mean 
while. Wanted to confirm whether it is safe to do this. 

 Thanks,
Murali Krishna


Re: Hardware Setup

2009-10-15 Thread Brian Bockelman

Hey Alex,

In order to lower cost, you'll probably want to order the worker nodes  
without hard drives then buy them separately.  HDFS provides a  
software-level RAID, so most of the reasonings behind buying hard  
drives from Dell/HP are irrelevant - you are just paying an extra $400  
per hard drive.  I know Dell sells the R410 which has 4 SATA bays; I'm  
sure Steve knows an HP model that has something similar.


However, BE VERY CAREFUL when you do this.  From experience, a certain  
large manufacturer (I don't know about Dell/HP) will refuse to ship  
(or sell separately) hard drive trays if you order their machine  
without hard drives.  When this happened to us, we were not able to  
return the machines because they were custom orders.  Eventually, we  
had to get someone to go to the machine shop and build 72 hard drive  
trays for us.


Worst. Experience. Ever.

So, ALWAYS ASK and make sure that you can buy empty hard drive trays  
for that specific model (or at least that it ships with them).


Brian

On Oct 15, 2009, at 10:48 AM, Alex Newman wrote:


 So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.




smime.p7s
Description: S/MIME cryptographic signature


Re: Hardware Setup

2009-10-15 Thread Patrick Angeles
After the discount, an equivalently configured Dell comes about 10-20% over
the Silicon Mechanics price. It's close enough that unless you're spending
100k it won't make that much of a difference. Talk to a rep, call them out
on the ridiculous drive pricing, buy at the end of their fiscal quarter.
Strip down the machines (no RAID cards, no DVD/CD drive, non-redundant power
supply, etc.) to get the price lower. No need for dedicated SATA drives with
RAID for your OS. Most of that is accessed during boot time so it won't
contend that much with HDFS.

We just got a bunch of Dell R410s with 24GB ram, 2x2.26Ghz procs and 4x1TB
drives.

I would go for beefier nodes with less quantity.  Of course, some of this
depends on the volume of data and type of processing that you do. If you're
running HBase, you would benefit from lots of RAM. You also have to remember
that dual socket configs are more power efficient so you can fit more in a
single rack.

Cheers,

- P

On Thu, Oct 15, 2009 at 11:48 AM, Alex Newman posi...@gmail.com wrote:

  So my company is looking at only using dell or hp for our
 hadoop cluster and a sun thumper to backup the data. The prices are
 ok, after a 40% discount, but realistically I am paying twice as much
 as if I went to silicon mechanics, and with a much slower machine. It
 seems as though the big expense are the disks. Even with a 40%
 discount 550$ per 1tb disk seems crazy expensive. Also, they are
 pushing me to build a smaller cluster (6 nodes) and I am pushing back
 for nodes half the size but having twice as many. So how much of a
 performance difference can I expect btwn 12 nodes with 1 xeon 5 series
 running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
 cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
 tb disks. Both setups will also have 2 very small sata drives in raid
 1 for the OS. I will be doing some stuff with hadoop and a lot of
 stuff with HBase. What are the considerations with HDFS performance
 with a low number of nodes,etc.



Re: Hardware Setup

2009-10-15 Thread Steve Loughran

Brian Bockelman wrote:

Hey Alex,

In order to lower cost, you'll probably want to order the worker nodes 
without hard drives then buy them separately.  HDFS provides a 
software-level RAID, so most of the reasonings behind buying hard drives 
from Dell/HP are irrelevant - you are just paying an extra $400 per hard 
drive.  I know Dell sells the R410 which has 4 SATA bays; I'm sure Steve 
knows an HP model that has something similar.


I will start the official disclaimer I make no recommendations about 
hardware here, so as not to get into trouble. Talk to you reseller or 
account team.


You can get servers with lots of drives in them DL180 and SL170z are 
acronyms that spring to mind.

Big issues to consider
 * server:CPU ratio
 * power budget
 * rack weight
 * do you ever plan to stick in more CPUs? Some systems take this, 
others don't.

 * Intel vs AMD
 * How much ECC RAM can you afford. And yes, it must be ECC.

server disks are higher RPM and specced for more hours than consumer 
disks, I don't know what that means in terms of lifespan, but the RPM 
translates into bandwidth off the disk.




However, BE VERY CAREFUL when you do this.  From experience, a certain 
large manufacturer (I don't know about Dell/HP) will refuse to ship (or 
sell separately) hard drive trays if you order their machine without 
hard drives.  When this happened to us, we were not able to return the 
machines because they were custom orders.  Eventually, we had to get 
someone to go to the machine shop and build 72 hard drive trays for us.


that is what physics PhD students are for, at least they didn't get a 
lifetimes radiation dose for this job




Worst. Experience. Ever.

So, ALWAYS ASK and make sure that you can buy empty hard drive trays for 
that specific model (or at least that it ships with them).


Brian




On Oct 15, 2009, at 10:48 AM, Alex Newman wrote:


 So my company is looking at only using dell or hp for our
hadoop cluster and a sun thumper to backup the data. The prices are
ok, after a 40% discount, but realistically I am paying twice as much
as if I went to silicon mechanics, and with a much slower machine. It
seems as though the big expense are the disks. Even with a 40%
discount 550$ per 1tb disk seems crazy expensive. Also, they are
pushing me to build a smaller cluster (6 nodes) and I am pushing back
for nodes half the size but having twice as many. So how much of a
performance difference can I expect btwn 12 nodes with 1 xeon 5 series
running at 2.26 ghz 8 gigs of ram with 4 1 tb disks and a 6 node
cluster with 2 xeon 5 series running at 2.26 16 gigs of ram with 8 1
tb disks. Both setups will also have 2 very small sata drives in raid
1 for the OS. I will be doing some stuff with hadoop and a lot of
stuff with HBase. What are the considerations with HDFS performance
with a low number of nodes,etc.





It's an interesting Q as to what is better, fewer nodes with more 
storage/CPU or more, smaller nodes.


Bigger servers
 * more chance of running code near the data
 * less data moved over the LAN at shuffle time
 * RAM consumption can be more agile across tasks.
 * increased chance of disk failure on a node; hadoop handles that very 
badly right now (pre 0.20 -datanode goes offline)


Smaller servers
 * easier to place data redundantly across machines
 * less RAM taken up by other people's jobs
 * more nodes stay up when a disk fails (less important on 0.20 onwards)
 * when a node goes down, less data to re-replicate across the other 
machines


1. I would like to hear other people's opinions,

2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks 
from your real data runs. Try that, collect some data, then go to your 
suppliers.


-Steve

COI disclaimer signature:
---
Hewlett-Packard Limited
Registered Office: Cain Road, Bracknell, Berks RG12 1HN
Registered No: 690597 England


Re: Hardware Setup

2009-10-15 Thread Patrick Angeles
On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo edlinuxg...@gmail.comwrote:


 No need for dedicated SATA drives with
 RAID for your OS. Most of that is accessed during boot time so it won't
 contend that much with HDFS.

 You may want to RAID your OS. If you lose a datanode with a large
 volume of data say (8 TB) Hadoop will begin the process of
 re-replicating that data elsewhere, that can use cluster resources.

 You MIGHT want to avoid that, or maybe you do not care.

 Having 2 disks for the OS is a waist of bays, so we got clever. Take a
 system with 8 drives @ 1TB. Slice off ~30 GB from two of the disks and
 use Linux software RAID-1 MIRROR for the OS+ swap.

 Now you don't need to separate disks for the OS and you don't run the
 risk of losing that one disk that takes down the entire DataNode.


Forgot to mention it, but that is exactly what we do. We considered
net-booting as an option, but we were time-constrained and so didn't look
that deeply into it. I'd be interested in hearing others that have used
networked boot...


RE: Can't I use hive? What does this exception mean?

2009-10-15 Thread Li, Tan
Clark,
I think the error shows that the hive failed when trying to run some DDL to the 
meta_store database.
How do you set up your meta data database of hive?
If it is derby, make sure you have only one connection in one directory. 
(Sometimes the error may be caused when you have another connection under the 
same directory and forget to close it)
If it is mysql, make sure you have downloaded the JDBC driver, which is not 
included in the hive package if you downloaded it form apache. (It is included 
in cloudera version)
Let me know if you have any further questions.
Tan Li (from shopping.com) 

-Original Message-
From: Clark Yang (杨卓荦) [mailto:clarkyzl-h...@yahoo.com.cn] 
Sent: Thursday, October 15, 2009 4:24 AM
To: common-user@hadoop.apache.org
Subject: Can't I use hive? What does this exception mean?

I have already installed hadoop correctly.
but what does that mean?

~/hive/build/dist/bin$ ./hive
Hive history 
file=/tmp/yangzhuoluo/hive_job_log_yangzhuoluo_200910151919_995767046.txt
hive show tables;
FAILED: Error in metadata: javax.jdo.JDOFatalInternalException: Unexpected 
exception caught.
NestedThrowables:
java.lang.reflect.InvocationTargetException
FAILED: Execution Error, return code 1 from 
org.apache.hadoop.hive.ql.exec.DDLTask


Can anybody tell me what does the exception mean? Thank you all very much.



  ___
  好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/


Re: job during datanode decommisioning?

2009-10-15 Thread Amandeep Khurana
If you have replication set up, it shouldn't be a problem...

On 10/15/09, Murali Krishna. P muralikpb...@yahoo.com wrote:
 Hi,
 Is it safe to run the jobs during the datanode decommissioning? Around 10%
 of the total datanodes are being decommissioned and i want to run a job mean
 while. Wanted to confirm whether it is safe to do this.

  Thanks,
 Murali Krishna



-- 


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: Hardware Setup

2009-10-15 Thread Edward Capriolo
On Thu, Oct 15, 2009 at 12:51 PM, Patrick Angeles
patrickange...@gmail.com wrote:
 On Thu, Oct 15, 2009 at 12:32 PM, Edward Capriolo 
 edlinuxg...@gmail.comwrote:


 No need for dedicated SATA drives with
 RAID for your OS. Most of that is accessed during boot time so it won't
 contend that much with HDFS.

 You may want to RAID your OS. If you lose a datanode with a large
 volume of data say (8 TB) Hadoop will begin the process of
 re-replicating that data elsewhere, that can use cluster resources.

 You MIGHT want to avoid that, or maybe you do not care.

 Having 2 disks for the OS is a waist of bays, so we got clever. Take a
 system with 8 drives @ 1TB. Slice off ~30 GB from two of the disks and
 use Linux software RAID-1 MIRROR for the OS+ swap.

 Now you don't need to separate disks for the OS and you don't run the
 risk of losing that one disk that takes down the entire DataNode.


 Forgot to mention it, but that is exactly what we do. We considered
 net-booting as an option, but we were time-constrained and so didn't look
 that deeply into it. I'd be interested in hearing others that have used
 networked boot...


Patrick,

I saw someone troubleshooting netboot recently on list. There was some
minor hangup, I they got it working

http://mail-archives.apache.org/mod_mbox/hadoop-common-user/200909.mbox/%3c4ac22618.5080...@sci.utah.edu%3e

Ed


Re: NullPointer on starting NameNode

2009-10-15 Thread bryn
Thanks Todd - it's all working perfectly now. By the way, where is the
Cloudera repository?

On Wed, 14 Oct 2009 10:02:22 -0700, Todd Lipcon t...@cloudera.com wrote:
 Hi Bryn,
 
 Just to let you know, we've queued the patch Hairong mentioned for the
 next update to our distribution, due out around the end of this month.
 
 Thanks!
 
 -Todd
 
 On Wed, Oct 14, 2009 at 9:15 AM, Bryn Divey b...@bengueladev.com wrote:
 Hi all,

 I'm getting the following on initializing my NameNode. The actual line
 throwing the exception is

      if (atime != -1) {
 -      long inodeTime = inode.getAccessTime();


 Have I corrupted the fsimage or something? This is on the Cloudera
 packaging of Hadoop 0.20.1+133.

 Regards,
 Bryn

 09/10/14 18:12:02 INFO metrics.RpcMetrics: Initializing RPC Metrics with
 hostName=NameNode, port=8020
 09/10/14 18:12:02 INFO namenode.NameNode: Namenode up at:
 10.23.4.172/10.23.4.172:8020
 09/10/14 18:12:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=NameNode, sessionId=null
 09/10/14 18:12:02 INFO metrics.NameNodeMetrics: Initializing
 NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
 09/10/14 18:12:03 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
 09/10/14 18:12:03 INFO namenode.FSNamesystem: supergroup=supergroup
 09/10/14 18:12:03 INFO namenode.FSNamesystem: isPermissionEnabled=false
 09/10/14 18:12:03 INFO metrics.FSNamesystemMetrics: Initializing
 FSNamesystemMetrics using context
 object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
 09/10/14 18:12:03 INFO namenode.FSNamesystem: Registered
 FSNamesystemStatusMBean
 09/10/14 18:12:03 INFO common.Storage: Number of files = 80
 09/10/14 18:12:03 INFO common.Storage: Number of files under
 construction = 0
 09/10/14 18:12:03 INFO common.Storage: Image file of size 19567 loaded
 in 0 seconds.
 09/10/14 18:12:03 ERROR namenode.NameNode:
 java.lang.NullPointerException
        at

org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
        at

org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
        at

org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
        at

org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
        at

org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
        at

org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
        at

org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
        at

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
        at

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
        at

org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:206)
        at

org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:288)
        at

org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:968)
        at
 org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:977)






Re: Hardware Setup

2009-10-15 Thread Allen Wittenauer
On 10/15/09 9:42 AM, Steve Loughran ste...@apache.org wrote:
 It's an interesting Q as to what is better, fewer nodes with more
 storage/CPU or more, smaller nodes.
 
 Bigger servers
   * more chance of running code near the data
   * less data moved over the LAN at shuffle time
   * RAM consumption can be more agile across tasks.
   * increased chance of disk failure on a node; hadoop handles that very
 badly right now (pre 0.20 -datanode goes offline)
 
 Smaller servers
   * easier to place data redundantly across machines
   * less RAM taken up by other people's jobs
   * more nodes stay up when a disk fails (less important on 0.20 onwards)
   * when a node goes down, less data to re-replicate across the other
 machines
 
 1. I would like to hear other people's opinions,

- Don't forget the about the more obvious things:  if you go with more disks
per server, that also means likely means less controllers doing IO.

- Keep in mind that fewer CPUs/less RAM=less task slots available.  While
your workflow may not be CPU-bound in the traditional sense, if you are
spawning 5000 maps, you're going to need quite a few slots to get your work
done in a reasonable time.

- To counter that, it seems we can run more tasks-per-node in LI's 2U config
than Y!'s 1U config.  But this might be an apples/oranges comparison (LI
uses Solaris+ZFS, Y! uses Linux+ext3).

 2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks
 from your real data runs. Try that, collect some data, then go to your
 suppliers.

+1



Re: NullPointer on starting NameNode

2009-10-15 Thread Todd Lipcon
On Thu, Oct 15, 2009 at 10:14 AM, b...@bengueladev.com wrote:

 Thanks Todd - it's all working perfectly now. By the way, where is the
 Cloudera repository?


http://archive.cloudera.com/

If you have any questions that are Cloudera-specific (around packaging,
etc), please use our GetSatisfaction page:
http://getsatisfaction.com/cloudera/products/cloudera_cloudera_s_distribution_for_hadoop

(we don't want to confuse people on this list if anything is specific to our
distro)

-Todd

 On Wed, 14 Oct 2009 10:02:22 -0700, Todd Lipcon t...@cloudera.com wrote:
  Hi Bryn,
 
  Just to let you know, we've queued the patch Hairong mentioned for the
  next update to our distribution, due out around the end of this month.
 
  Thanks!
 
  -Todd
 
  On Wed, Oct 14, 2009 at 9:15 AM, Bryn Divey b...@bengueladev.com
 wrote:
  Hi all,
 
  I'm getting the following on initializing my NameNode. The actual line
  throwing the exception is
 
   if (atime != -1) {
  -  long inodeTime = inode.getAccessTime();
 
 
  Have I corrupted the fsimage or something? This is on the Cloudera
  packaging of Hadoop 0.20.1+133.
 
  Regards,
  Bryn
 
  09/10/14 18:12:02 INFO metrics.RpcMetrics: Initializing RPC Metrics with
  hostName=NameNode, port=8020
  09/10/14 18:12:02 INFO namenode.NameNode: Namenode up at:
  10.23.4.172/10.23.4.172:8020
  09/10/14 18:12:02 INFO jvm.JvmMetrics: Initializing JVM Metrics with
  processName=NameNode, sessionId=null
  09/10/14 18:12:02 INFO metrics.NameNodeMetrics: Initializing
  NameNodeMeterics using context
  object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
  09/10/14 18:12:03 INFO namenode.FSNamesystem: fsOwner=hadoop,hadoop
  09/10/14 18:12:03 INFO namenode.FSNamesystem: supergroup=supergroup
  09/10/14 18:12:03 INFO namenode.FSNamesystem: isPermissionEnabled=false
  09/10/14 18:12:03 INFO metrics.FSNamesystemMetrics: Initializing
  FSNamesystemMetrics using context
  object:org.apache.hadoop.metrics.spi.NoEmitMetricsContext
  09/10/14 18:12:03 INFO namenode.FSNamesystem: Registered
  FSNamesystemStatusMBean
  09/10/14 18:12:03 INFO common.Storage: Number of files = 80
  09/10/14 18:12:03 INFO common.Storage: Number of files under
  construction = 0
  09/10/14 18:12:03 INFO common.Storage: Image file of size 19567 loaded
  in 0 seconds.
  09/10/14 18:12:03 ERROR namenode.NameNode:
  java.lang.NullPointerException
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1232)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSDirectory.unprotectedSetTimes(FSDirectory.java:1221)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSEditLog.loadFSEdits(FSEditLog.java:776)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSEdits(FSImage.java:992)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:812)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:364)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:311)
 at
 

 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:292)
 at
 

 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:206)
 at
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:288)
 at
 

 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:968)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:977)
 
 
 
 



How can I deploy 100 blocks onto 10 datanodes with each node have 10 blocks?

2009-10-15 Thread Huang Qian
Hi everyone. I am working on a project with hadoop and now I come across
some problem. How can I deploy 100 files, with each file have one block by
setting the blocksize and controling the file size, on to 10 datanode, and
make sure each datanode has 10 blocks. I know the file system can deploy the
blocks automaticly, but I want to make sure for the assigns files, the files
will be deployed well-proportioned. How can I make it by the hadoop tool or
api?

Huang Qian(黄骞)
Institute of Remote Sensing and GIS,Peking University
Phone: (86-10) 5276-3109
Mobile: (86) 1590-126-8883
Address:Rm.554,Building 1,ChangChunXinYuan,Peking
Univ.,Beijing(100871),CHINA


Re: Hardware performance from HADOOP cluster

2009-10-15 Thread tim robertson
Yeah they are single proc machines and other than setting to 4
map/reduces, completely 0.20.1 vanilla installation.

I will tune it up in the morning based on what I can find on the web
(e.g. cloudera guidelines) and post the results.  I am going to be
running HBase on top of this, but want to make sure the HDFS/MR is
running sound before continuing.

Seems there are a few people at the moment setting up clusters - might
it be worth adding our config and results to
http://wiki.apache.org/hadoop/HardwareBenchmarks ?

For people like me (first cluster set up from scratch - previously
used the EC2 scripts) it is nice to sanity check things look about
right.  The mailing lists suggest there are a few small clusters of
medium spec machines springing up.

Cheers,
Tim




On Thu, Oct 15, 2009 at 5:52 PM, Patrick Angeles
patrickange...@gmail.com wrote:
 Hi Tim,
 I assume those are single proc machines?

 I got 649 secs on 70GB of data for our 7-node cluster (~11 mins), but we
 have dual quad Nehalems (2.26Ghz).

 On Thu, Oct 15, 2009 at 11:34 AM, tim robertson
 timrobertson...@gmail.comwrote:

 Hi Usmam,

 So on my 10 node cluster (9 DN) with 4 maps and 4 reduces (I plan on
 high memory jobs so picked 4 only)
 [9 DN of Dell R300: 2.83G Quadcore (2x6MB cache), 8G RAM and 2x500G SATA
 drives]

 Using your template for stats, I get the following with no tuning:

 GENERATE RANDOM DATA
 Wrote out 90GB of random binary data:
 Map output records=9198009
 The job took 350 seconds. (approximately: 6 minutes)

 SORT RANDOM GENERATED DATA
 Map output records= 9197821
 Reduce input records=9197821
 The job took 2176 seconds. (approximately: 36mins).

 So pretty similar to your initial benchmark.  I will tune it a bit
 tomorrow and rerun.

 If you spent time tuning your cluster and it was successful, please
 can you share your config?

 Cheers,
 Tim





 On Thu, Oct 15, 2009 at 11:32 AM, Usman Waheed usm...@opera.com wrote:
  Hi Todd,
 
  Some changes have been applied to the cluster based on the documentation
  (URL) you noted below,
  like file descriptor settings and io.file.buffer.size. I will check out
 the
  other settings which I haven't applied yet.
 
  My map/reduce slot settings from my hadoop-site.xml and
 hadoop-default.xml
  on all nodes in the cluster.
 
  _*hadoop-site.xml
  *_mapred.tasktracker.task.maximum = 2
  mapred.tasktracker.map.tasks.maximum = 8
  mapred.tasktracker.reduce.tasks.maximum = 8
  _*
  hadoop-default.xml
  *_mapred.map.tasks = 2
  mapred.reduce.tasks = 1
 
  Thanks,
  Usman
 
 
  This seems a bit slow for that setup (4-5 MB/sec/node sorting). Have
  you changed the configurations at all? There are some notes on this
  blog post that might help your performance a bit:
 
 
 
 http://www.cloudera.com/blog/2009/03/30/configuration-parameters-what-can-you-just-ignore/
 
  How many map and reduce slots did you configure for the daemons? If
  you have Ganglia installed you can usually get a good idea of whether
  you're using your resources well by looking at the graphs while
  running a job like this sort.
 
  -Todd
 
  On Wed, Oct 14, 2009 at 4:04 AM, Usman Waheed usm...@opera.com wrote:
 
 
  Here are the results i got from my 4 node cluster (correction i noted 5
  earlier). One of my nodes out of the 4 is a namenode+datanode both.
 
  GENERATE RANDOM DATA
  Wrote out 40GB of random binary data:
  Map output records=4088301
  The job took 358 seconds. (approximately: 6 minutes).
 
  SORT RANDOM GENERATED DATA
  Map output records=4088301
  Reduce input records=4088301
  The job took 2136 seconds. (approximately: 35 minutes).
 
  VALIDATION OF SORTED DATA
  The job took 183 seconds.
  SUCCESS! Validated the MapReduce framework's 'sort' successfully.
 
  It would be interesting to see what performance numbers others with a
  similar setup have obtained.
 
  Thanks,
  Usman
 
 
 
  I am setting up a new cluster of 10 nodes of 2.83G Quadcore (2x6MB
  cache), 8G RAM and 2x500G drives, and will do the same soon.  Got some
  issues though so it won't start up...
 
  Tim
 
 
  On Wed, Oct 14, 2009 at 11:36 AM, Usman Waheed usm...@opera.com
 wrote:
 
 
 
  Thanks Tim, i will check it out and post my results for comments.
  -Usman
 
 
 
  Might it be worth running the http://wiki.apache.org/hadoop/Sortand
  posting your results for comment?
 
  Tim
 
 
  On Wed, Oct 14, 2009 at 10:48 AM, Usman Waheed usm...@opera.com
  wrote:
 
 
 
 
  Hi,
 
  Is there a way to tell what kind of performance numbers one can
  expect
  out
  of their cluster given a certain set of specs.
 
  For example i have 5 nodes in my cluster that all have the
 following
  hardware configuration(s):
  Quad Core 2.0GHz, 8GB RAM, 4x1TB disks and are all on the same
 rack.
 
  Thanks,
  Usman
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 




Re: Hardware Setup

2009-10-15 Thread tim robertson
Hi Alex,

I am also doing a little on HBase - I think I have heard that a few
higher memory machines with more spindles and cores per machine beat
more smaller machines with similar total capacity (I *guess* this is
due to memory buffers and data locality).  Please don't take my word
for it, but I recommend posting the same to the HBase list -
http://hadoop.apache.org/hbase/mailing_lists.html#Users.

Cheers,
Tim



On Thu, Oct 15, 2009 at 7:54 PM, Allen Wittenauer
awittena...@linkedin.com wrote:
 On 10/15/09 9:42 AM, Steve Loughran ste...@apache.org wrote:
 It's an interesting Q as to what is better, fewer nodes with more
 storage/CPU or more, smaller nodes.

 Bigger servers
   * more chance of running code near the data
   * less data moved over the LAN at shuffle time
   * RAM consumption can be more agile across tasks.
   * increased chance of disk failure on a node; hadoop handles that very
 badly right now (pre 0.20 -datanode goes offline)

 Smaller servers
   * easier to place data redundantly across machines
   * less RAM taken up by other people's jobs
   * more nodes stay up when a disk fails (less important on 0.20 onwards)
   * when a node goes down, less data to re-replicate across the other
 machines

 1. I would like to hear other people's opinions,

 - Don't forget the about the more obvious things:  if you go with more disks
 per server, that also means likely means less controllers doing IO.

 - Keep in mind that fewer CPUs/less RAM=less task slots available.  While
 your workflow may not be CPU-bound in the traditional sense, if you are
 spawning 5000 maps, you're going to need quite a few slots to get your work
 done in a reasonable time.

 - To counter that, it seems we can run more tasks-per-node in LI's 2U config
 than Y!'s 1U config.  But this might be an apples/oranges comparison (LI
 uses Solaris+ZFS, Y! uses Linux+ext3).

 2. The gridmix 2 benchmarking stuff tries to create synthetic benchmarks
 from your real data runs. Try that, collect some data, then go to your
 suppliers.

 +1




Hadoop Developer Needed

2009-10-15 Thread alevin

Overview 

The SQL Data Migration Specialist plays a crucial role in converting new
Client's data onto Brilig's service platforms. We are looking for a talented
and energetic full-time freelance programmer to work both remotely and
onsite at our midtown Manhattan location. The Specialist will work with our
clients' technical teams to determine the optimal formats and requirements
to create files for subsequent import into a Brilig remote database during
the implementation process. This process involves extracting, scrubbing,
combining, transforming, validating and importing large data tables into
final data sets suitable for loading into Brilig's defined databases. The
Specialist will be responsible for creating/editing the database structure
and writing of all import scripts and programs. The ability to work on
multiple projects simultaneously while meeting tight deadlines is critical.
This project will last for 3 months but may be extended or may eventually
lead to a full time position in our fun and exciting startup. Must be able
to travel to client meetings and work independently. 

Responsibilities: 

- Subject Matter Expert on software tools used in the entire data migration
process from extraction to validation and load 
- Design, develop and execute quality data movement processes that are
consistent, repeatable and scalable 
- Streamline testing, audit and validation processes through data scrubbing
routines and presentation of audit reports prior to load 
- Roll out newly developed processes via documentation and training 
- Maintain and manage a template library of executed solutions to leverage
against future opportunities 
- Identify, clarify, and resolve issues and risks, escalating them as needed 
- Build and nourish strong business relationships with external clients 

Please include: 

- Salary Requirements 
- Availability 

Experience 

- At least 3-5 years experience in the development of java applications 
- Use of XML and other protocols for data exchange between systems 
- SQL database design and implementation 
- Experience with Eclipse, Maven, and SVN a plus 
- Experience with Htable and Hadoop a big plus 
- Excellent communication skills with both technical and non-technical
colleagues 
- Upper management and client facing skills 
- Interest in keeping up with technology advances 

PLEASE NOTE: 
US citizens and Green Card Holders and those authorized to work in the US
only. We are unable to sponsor or transfer H-1B candidates. 

Contact: 

Alex Levin, COO 
Brilig 
ale...@brilig.com 


-- 
View this message in context: 
http://www.nabble.com/Hadoop-Developer-Needed-tp25914537p25914537.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



newbie help, type mismatch in mapper

2009-10-15 Thread yz5od2

Hi,
I am new to Hadoop so this might be an easy question for someone to  
help me with.


I continually am getting this exception (my code follows below)

java.io.IOException: Type mismatch in key from map: expected  
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
	at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.collect(MapTask.java:807)
	at org.apache.hadoop.mapred.MapTask 
$NewOutputCollector.write(MapTask.java:504)
	at  
org 
.apache 
.hadoop 
.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


Running 0.20.1

I have a text file with lines of data separated by carriage returns.  
This is properly stored in a directory within HDFS. I only have a  
Mapping task for processing this file, after the mapping is done it  
should go straight to output, No reduce or combiner functions.


I am just trying to test to see if this will run. The mapper just  
takes the data sent to it and adds it back to the collector as 2 text  
values.



My Mapper:


public class MyTestMapper extends MapperLongWritable,Text,Text,Text {

	public void map(Object key, Text value, Context context) throws  
IOException, InterruptedException {

String key = key.toString();
String line = value.toString();

String id = extractId(line);
String reformattedLine = reformatLine(line);

context.write(new Text(id), new Text(reformattedLine));
}
}


My job submission code:
-

Job job = new Job(conf);
job.setJarByClass(MyTestMapper.class);
job.setMapperClass(MyTestMapper.class); 

FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt));
FileOutputFormat.setOutputPath(job, new Path(/myDir/output/ 
results-+System.currentTimeMillis()+.txt));


job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.submit();

type mismatch, newbie help

2009-10-15 Thread yz5od2

Hi,
I am new to Hadoop so this might be an easy question for someone to  
help me with.


I continually am getting this exception (my code follows below)

java.io.IOException: Type mismatch in key from map: expected  
org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
	at org.apache.hadoop.mapred.MapTask 
$MapOutputBuffer.collect(MapTask.java:807)
	at org.apache.hadoop.mapred.MapTask 
$NewOutputCollector.write(MapTask.java:504)
	at  
org 
.apache 
.hadoop 
.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)

at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


Running 0.20.1

I have a text file with lines of data separated by carriage returns.  
This is properly stored in a directory within HDFS. I only have a  
Mapping task for processing this file, after the mapping is done it  
should go straight to output, No reduce or combiner functions.


I am just trying to test to see if this will run. The mapper just  
takes the data sent to it and adds it back to the collector as 2 text  
values.



My Mapper:


public class MyTestMapper extends MapperLongWritable,Text,Text,Text {

	public void map(Object key, Text value, Context context) throws  
IOException, InterruptedException {

String key = key.toString();
String line = value.toString();

String id = extractId(line);
String reformattedLine = reformatLine(line);

context.write(new Text(id), new Text(reformattedLine));
}
}


My job submission code:
-

Job job = new Job(conf);
job.setJarByClass(MyTestMapper.class);
job.setMapperClass(MyTestMapper.class); 

FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt));
FileOutputFormat.setOutputPath(job, new Path(/myDir/output/ 
results-+System.currentTimeMillis()+.txt));


job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(Text.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.submit();

Re: Hadoop Developer Needed

2009-10-15 Thread Last-chance Architect

Hi Alex,

I'm a senior Java/J2EE developer/architect. I am American, but live 
abroad. I've been using Hadoop for the past year, but have many years 
experience in the field of large-scale data analysis, warehousing, etc.


Let me know what you think.

Cheers,

Lajos

alevin wrote:
Overview 


The SQL Data Migration Specialist plays a crucial role in converting new
Client's data onto Brilig's service platforms. We are looking for a talented
and energetic full-time freelance programmer to work both remotely and
onsite at our midtown Manhattan location. The Specialist will work with our
clients' technical teams to determine the optimal formats and requirements
to create files for subsequent import into a Brilig remote database during
the implementation process. This process involves extracting, scrubbing,
combining, transforming, validating and importing large data tables into
final data sets suitable for loading into Brilig's defined databases. The
Specialist will be responsible for creating/editing the database structure
and writing of all import scripts and programs. The ability to work on
multiple projects simultaneously while meeting tight deadlines is critical.
This project will last for 3 months but may be extended or may eventually
lead to a full time position in our fun and exciting startup. Must be able
to travel to client meetings and work independently. 

Responsibilities: 


- Subject Matter Expert on software tools used in the entire data migration
process from extraction to validation and load 
- Design, develop and execute quality data movement processes that are
consistent, repeatable and scalable 
- Streamline testing, audit and validation processes through data scrubbing
routines and presentation of audit reports prior to load 
- Roll out newly developed processes via documentation and training 
- Maintain and manage a template library of executed solutions to leverage
against future opportunities 
- Identify, clarify, and resolve issues and risks, escalating them as needed 
- Build and nourish strong business relationships with external clients 

Please include: 

- Salary Requirements 
- Availability 

Experience 

- At least 3-5 years experience in the development of java applications 
- Use of XML and other protocols for data exchange between systems 
- SQL database design and implementation 
- Experience with Eclipse, Maven, and SVN a plus 
- Experience with Htable and Hadoop a big plus 
- Excellent communication skills with both technical and non-technical
colleagues 
- Upper management and client facing skills 
- Interest in keeping up with technology advances 

PLEASE NOTE: 
US citizens and Green Card Holders and those authorized to work in the US
only. We are unable to sponsor or transfer H-1B candidates. 

Contact: 

Alex Levin, COO 
Brilig 
ale...@brilig.com 





--
***
The 'Last-Chance' Architect
www.galatea.com
(US) +1 303 731 3116
(UK) +44 20 8144 4367
***


Re: Optimization of cpu and i/o usage / other bottlenecks?

2009-10-15 Thread Scott Carey
What Hadoop version?

On a clusster this size there are two things to check right away:

1.  In the Hadoop UI, during the job, are the reduce and map slots close to
being filled up  most of the time, or are tasks completing faster than the
scheduler can keep up so that there are often many empty slots?

For 0.19.x and 0.20.x on a small cluster like this, use the Fair Scheduler
and make sure the configuration parameter that allows it to schedule more
than one task per heartbeat is on (at least one map and one reduce per,
which is supported in 0.19.x).  This alone will cut times down if the number
of map and reduce tasks is at least 2x the number of nodes.


2. Your CPU, Disk and Network aren't saturated -- take a look at the logs of
the reduce tasks and look for long delays in the shuffle.
Utilization is throttled by a bug in the reducer shuffle phase, not fixed
until 0.21.  Simply put, a single reduce task won't fetch more than one map
output from another node every 2 seconds (though it can fetch from multiple
nodes at once).  Fix this by commenting out one line in 0.18.x, 0.19.x or
0.20.x -- see my comment here:
http://issues.apache.org/jira/browse/MAPREDUCE-318
From June 10 2009.

I saw shuffle times on small clusters with large map task count per node
ratio decrease by a factor of 30 from that one line fix.  It was the only
way to get the network to ever be close to saturation on any node.

The delays for low latency jobs on smaller clusters are predominantly
artificial due to the nature of most RPC being ping-response and most design
and testing done for large clusters of machines that only run a couple maps
or reduces per TaskTracker.


Do the above, and you won't be nearly as sensitive to the size of data per
task for low latency jobs as out-of-the-box Hadoop.  Your overall
utilization will go up quite a bit.

-Scott


On 10/14/09 7:31 AM, Chris Seline ch...@searchles.com wrote:

 No, there doesn't seem to be all that much network traffic. Most of the
 time traffic (measured with nethogs) is about 15-30K/s on the master and
 slaves during map, sometimes it bursts up 5-10 MB/s on a slave for maybe
 5-10 seconds on a query that takes 10 minutes, but that is still less
 than what I see in scp transfers on EC2, which is typically about 30 MB/s.
 
 thanks
 
 Chris
 
 Jason Venner wrote:
 are your network interface or the namenode/jobtracker/datanodes saturated
 
 
 On Tue, Oct 13, 2009 at 9:05 AM, Chris Seline ch...@searchles.com wrote:
 
  
 I am using the 0.3 Cloudera scripts to start a Hadoop cluster on EC2 of 11
 c1.xlarge instances (1 master, 10 slaves), that is the biggest instance
 available with 20 compute units and 4x 400gb disks.
 
 I wrote some scripts to test many (100's) of configurations running a
 particular Hive query to try to make it as fast as possible, but no matter
 what I don't seem to be able to get above roughly 45% cpu utilization on the
 slaves, and not more than about 1.5% wait state. I have also measured
 network traffic and there don't seem to be bottlenecks there at all.
 
 Here are some typical CPU utilization lines from top on a slave when
 running a query:
 Cpu(s): 33.9%us,  7.4%sy,  0.0%ni, 56.8%id,  0.6%wa,  0.0%hi,  0.5%si,
  0.7%st
 Cpu(s): 33.6%us,  5.9%sy,  0.0%ni, 58.7%id,  0.9%wa,  0.0%hi,  0.4%si,
  0.5%st
 Cpu(s): 33.9%us,  7.2%sy,  0.0%ni, 56.8%id,  0.5%wa,  0.0%hi,  0.6%si,
  1.0%st
 Cpu(s): 38.6%us,  8.7%sy,  0.0%ni, 50.8%id,  0.5%wa,  0.0%hi,  0.7%si,
  0.7%st
 Cpu(s): 36.8%us,  7.4%sy,  0.0%ni, 53.6%id,  0.4%wa,  0.0%hi,  0.5%si,
  1.3%st
 
 It seems like if tuned properly, I should be able to max out my cpu (or my
 disk) and get roughly twice the performance I am seeing now. None of the
 parameters I am tuning seem to be able to achieve this. Adjusting
 mapred.map.tasks and mapred.reduce.tasks does help somewhat, and setting the
 io.file.buffer.size to 4096 does better than the default, but the rest of
 the values I am testing seem to have little positive  effect.
 
 These are the parameters I am testing, and the values tried:
 
 io.sort.factor=2,3,4,5,10,15,20,25,30,50,100
 
 mapred.job.shuffle.merge.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
 0,0.93,0.95,0.97,0.98,0.99
 io.bytes.per.checksum=256,512,1024,2048,4192
 mapred.output.compress=true,false
 hive.exec.compress.intermediate=true,false
 
 hive.map.aggr.hash.min.reduction=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.9
 0,0.93,0.95,0.97,0.98,0.99
 mapred.map.tasks=1,2,3,4,5,6,8,10,12,15,20,25,30,40,50,60,75,100,150,200
 
 mapred.child.java.opts=-Xmx400m,-Xmx500m,-Xmx600m,-Xmx700m,-Xmx800m,-Xmx900m
 ,-Xmx1000m,-Xmx1200m,-Xmx1400m,-Xmx1600m,-Xmx2000m
 mapred.reduce.tasks=5,10,15,20,25,30,35,40,50,60,70,80,100,125,150,200
 mapred.merge.recordsBeforeProgress=5000,1,2,3
 
 mapred.job.shuffle.input.buffer.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0
 .80,0.90,0.93,0.95,0.99
 
 io.sort.spill.percent=0.10,0.20,0.30,0.40,0.50,0.60,0.70,0.80,0.90,0.93,0.95
 ,0.99
 

Re: Hadoop Developer Needed

2009-10-15 Thread Last-chance Architect

Whoops guys, sorry. Hit the reply too soon ;)

Lajos


Last-chance Architect wrote:

Hi Alex,

I'm a senior Java/J2EE developer/architect. I am American, but live 
abroad. I've been using Hadoop for the past year, but have many years 
experience in the field of large-scale data analysis, warehousing, etc.


Let me know what you think.

Cheers,

Lajos

alevin wrote:

Overview
The SQL Data Migration Specialist plays a crucial role in converting new
Client's data onto Brilig's service platforms. We are looking for a 
talented

and energetic full-time freelance programmer to work both remotely and
onsite at our midtown Manhattan location. The Specialist will work 
with our
clients' technical teams to determine the optimal formats and 
requirements
to create files for subsequent import into a Brilig remote database 
during

the implementation process. This process involves extracting, scrubbing,
combining, transforming, validating and importing large data tables into
final data sets suitable for loading into Brilig's defined databases. The
Specialist will be responsible for creating/editing the database 
structure

and writing of all import scripts and programs. The ability to work on
multiple projects simultaneously while meeting tight deadlines is 
critical.

This project will last for 3 months but may be extended or may eventually
lead to a full time position in our fun and exciting startup. Must be 
able

to travel to client meetings and work independently.
Responsibilities:
- Subject Matter Expert on software tools used in the entire data 
migration
process from extraction to validation and load - Design, develop and 
execute quality data movement processes that are
consistent, repeatable and scalable - Streamline testing, audit and 
validation processes through data scrubbing
routines and presentation of audit reports prior to load - Roll out 
newly developed processes via documentation and training - Maintain 
and manage a template library of executed solutions to leverage
against future opportunities - Identify, clarify, and resolve issues 
and risks, escalating them as needed - Build and nourish strong 
business relationships with external clients

Please include:
- Salary Requirements - Availability
Experience
- At least 3-5 years experience in the development of java 
applications - Use of XML and other protocols for data exchange 
between systems - SQL database design and implementation - Experience 
with Eclipse, Maven, and SVN a plus - Experience with Htable and 
Hadoop a big plus - Excellent communication skills with both technical 
and non-technical
colleagues - Upper management and client facing skills - Interest in 
keeping up with technology advances
PLEASE NOTE: US citizens and Green Card Holders and those authorized 
to work in the US

only. We are unable to sponsor or transfer H-1B candidates.
Contact:
Alex Levin, COO Brilig ale...@brilig.com





--
***
The 'Last-Chance' Architect
www.galatea.com
(US) +1 303 731 3116
(UK) +44 20 8144 4367
***


Error in FileSystem.get()

2009-10-15 Thread Bhupesh Bansal
Hey Folks, 

I am seeing a very weird problem in FileSystem.get(Configuration).

I want to get a FileSystem given the configuration, so I am using

  Configuration conf = new Configuration();

 _fs = FileSystem.get(conf);


The problem is I am getting LocalFileSystem on some machines and Distributed
on others. I am printing conf.get(fs.default.name) at all places and
It returns the right HDFS value 'hdfs://dummy:9000'

My expectation is looking at fs.default.name if it is hdfs:// it should give
me a DistributedFileSystem always.

Best
Bhupesh







Re: Error in FileSystem.get()

2009-10-15 Thread Ashutosh Chauhan
Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.)
Make sure your configs are consistent on all nodes across entire cluster and
are pointing to correct fs.

Hope it helps,
Ashutosh

On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com wrote:

 Hey Folks,

 I am seeing a very weird problem in FileSystem.get(Configuration).

 I want to get a FileSystem given the configuration, so I am using

  Configuration conf = new Configuration();

 _fs = FileSystem.get(conf);


 The problem is I am getting LocalFileSystem on some machines and
 Distributed
 on others. I am printing conf.get(fs.default.name) at all places and
 It returns the right HDFS value 'hdfs://dummy:9000'

 My expectation is looking at fs.default.name if it is hdfs:// it should
 give
 me a DistributedFileSystem always.

 Best
 Bhupesh








Re: Error in FileSystem.get()

2009-10-15 Thread Bhupesh Bansal
This code is not map/reduce code and run only on single machine and
Also each node prints the right value for fs.default.name so it is
Reading the right configuration file too .. The issue looks like use of
CACHE in filesystem and someplace my code is setting up a wrong value
If that is possible.

Best
Bhupesh




On 10/15/09 1:46 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com wrote:

 Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.)
 Make sure your configs are consistent on all nodes across entire cluster and
 are pointing to correct fs.
 
 Hope it helps,
 Ashutosh
 
 On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com wrote:
 
 Hey Folks,
 
 I am seeing a very weird problem in FileSystem.get(Configuration).
 
 I want to get a FileSystem given the configuration, so I am using
 
  Configuration conf = new Configuration();
 
 _fs = FileSystem.get(conf);
 
 
 The problem is I am getting LocalFileSystem on some machines and
 Distributed
 on others. I am printing conf.get(fs.default.name) at all places and
 It returns the right HDFS value 'hdfs://dummy:9000'
 
 My expectation is looking at fs.default.name if it is hdfs:// it should
 give
 me a DistributedFileSystem always.
 
 Best
 Bhupesh
 
 
 
 
 
 



Map Recude code doubt

2009-10-15 Thread shwitzu

Hello All,

I was wondering if our map reduce code can just return the location of the
file? Or place the actual file in a given output directory by searching
based on a keyword.

Let me make myself clear

If there are 100 image files in my HDFS and I want to extract one image
file. If I give a keyword which matches the name of a file in my HDFS will
my mapper and  reducer code be able to locate that file and put back the 
original file in a given location??

Please let me know if you have need more information.

Thanks
-- 
View this message in context: 
http://www.nabble.com/Map-Recude-code-doubt-tp25916994p25916994.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Map Recude code doubt

2009-10-15 Thread Last-chance Architect

Shwitzu,

Why can't you just use query thru a Filesystem object and find the file 
you want?


Lajos


shwitzu wrote:

Hello All,

I was wondering if our map reduce code can just return the location of the
file? Or place the actual file in a given output directory by searching
based on a keyword.

Let me make myself clear

If there are 100 image files in my HDFS and I want to extract one image
file. If I give a keyword which matches the name of a file in my HDFS will
my mapper and  reducer code be able to locate that file and put back the 
original file in a given location??


Please let me know if you have need more information.

Thanks


--
***
The 'Last-Chance' Architect
www.galatea.com
(US) +1 303 731 3116
(UK) +44 20 8144 4367
***


Re: Error in FileSystem.get()

2009-10-15 Thread Aaron Kimball
Bhupesh: If you use FileSystem.newInstance(), does that return the correct
object type? This sidesteps CACHE.
- A

On Thu, Oct 15, 2009 at 3:07 PM, Bhupesh Bansal bban...@linkedin.comwrote:

 This code is not map/reduce code and run only on single machine and
 Also each node prints the right value for fs.default.name so it is
 Reading the right configuration file too .. The issue looks like use of
 CACHE in filesystem and someplace my code is setting up a wrong value
 If that is possible.

 Best
 Bhupesh




 On 10/15/09 1:46 PM, Ashutosh Chauhan ashutosh.chau...@gmail.com
 wrote:

  Each node reads its own conf files (mapred-site.xml, hdfs-site.xml etc.)
  Make sure your configs are consistent on all nodes across entire cluster
 and
  are pointing to correct fs.
 
  Hope it helps,
  Ashutosh
 
  On Thu, Oct 15, 2009 at 16:36, Bhupesh Bansal bban...@linkedin.com
 wrote:
 
  Hey Folks,
 
  I am seeing a very weird problem in FileSystem.get(Configuration).
 
  I want to get a FileSystem given the configuration, so I am using
 
   Configuration conf = new Configuration();
 
  _fs = FileSystem.get(conf);
 
 
  The problem is I am getting LocalFileSystem on some machines and
  Distributed
  on others. I am printing conf.get(fs.default.name) at all places and
  It returns the right HDFS value 'hdfs://dummy:9000'
 
  My expectation is looking at fs.default.name if it is hdfs:// it should
  give
  me a DistributedFileSystem always.
 
  Best
  Bhupesh
 
 
 
 
 
 




Re: type mismatch, newbie help

2009-10-15 Thread Ahad Rana
Hi,
Unfortunately Mapper is now a class and from your call stack it seems that
Mapper's default  map implementation is being called (instead of the one you
defined in your class), which is passing the LongWritable key to the
collector. You should use @Override to have the compiler help you figure out
why your map function doesn't have the same signature as the one defined in
the base class.

Best of luck.

Ahad.

On Thu, Oct 15, 2009 at 10:29 AM, yz5od2 woods5242-outdo...@yahoo.comwrote:

 Hi,
 I am new to Hadoop so this might be an easy question for someone to help me
 with.

 I continually am getting this exception (my code follows below)

 java.io.IOException: Type mismatch in key from map: expected
 org.apache.hadoop.io.Text, recieved org.apache.hadoop.io.LongWritable
at
 org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:807)
at
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
at
 org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at org.apache.hadoop.mapreduce.Mapper.map(Mapper.java:124)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)


 Running 0.20.1

 I have a text file with lines of data separated by carriage returns. This
 is properly stored in a directory within HDFS. I only have a Mapping task
 for processing this file, after the mapping is done it should go straight to
 output, No reduce or combiner functions.

 I am just trying to test to see if this will run. The mapper just takes the
 data sent to it and adds it back to the collector as 2 text values.

 
 My Mapper:
 

 public class MyTestMapper extends MapperLongWritable,Text,Text,Text {

public void map(Object key, Text value, Context context) throws
 IOException, InterruptedException {
String key = key.toString();
String line = value.toString();

String id = extractId(line);
String reformattedLine = reformatLine(line);

context.write(new Text(id), new Text(reformattedLine));
}
 }

 
 My job submission code:
 -

 Job job = new Job(conf);
 job.setJarByClass(MyTestMapper.class);
 job.setMapperClass(MyTestMapper.class);

 FileInputFormat.addInputPath(job, new Path(/myDir/sample.txt));
 FileOutputFormat.setOutputPath(job, new
 Path(/myDir/output/results-+System.currentTimeMillis()+.txt));

 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(Text.class);

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);

 job.submit();