Fwd: hadoop submit job very slow

2011-08-12 Thread air
-- Forwarded message -- From: air Date: 2011/8/11 Subject: hadoop submit job very slow To: CDH Users I am using CDH3 update1 hadoop 0.20.2, the cluster is composed of 9 nodes(1NN + 8 (DN and TT)) , I found that it's very slow to submit job to the jobtracker (using hive or hado

Fwd: hadoop submit job very slow

2011-08-12 Thread air
-- Forwarded message -- From: air Date: 2011/8/11 Subject: Re: hadoop submit job very slow To: CDH Users when I execute : hadoop fsck / , it shows Total size:2246829645313 B (Total open files size: 1144403881 B) Total dirs:22494 Total files: 47853 (Files currently

Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Sofia Georgiakaki
Good morning, I would like to store some files in the distributed cache, in order to be opened and read from the mappers. The files are produced by an other Job and are sequence files. I am not sure if that format is proper for the distributed cache, as the files in distr.cache are stored and re

Re: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Dino Kečo
Hi Sofia, I assume that output of first job is stored on HDFS. In that case I would directly read file from Mappers without using distributed cache. If you put file into distributed cache that would add one more copy operation into your process. Thanks, dino On Fri, Aug 12, 2011 at 9:53 AM, Sof

Re: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Sofia Georgiakaki
Thank you for the reply! In each map(), I need to open-read-close these files (more than 2 in the general case, and maybe up to 20 or more), in order to make some checks. Considering the huge amount of data in the input, making all these file operations on HDFS will kill the performance!!! So I

Re: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Joey Echeverria
You can use any kind of format for files in the distributed cache, so yes you can use sequence files. They should be faster to parse than most text formats. -Joey On Fri, Aug 12, 2011 at 4:56 AM, Sofia Georgiakaki wrote: > Thank you for the reply! > In each map(), I need to open-read-close these

RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Adam Shook
If you are looking for performance gains, then possibly reading these files once during the setup() call in your Mapper and storing them in some data structure like a Map or a List will give you benefits. Having to open/close the files during each map call will have a lot of unneeded I/O. Yo

RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread Ian Michael Gumby
This whole thread doesn't make a lot of sense. If your first m/r job creates the sequence files, which you then use as input files to your second job, you don't need to use distributed cache since the output of the first m/r job is going to be in HDFS. (Dino is correct on that account.) Sofia

Speed up node under replicated block during decomission

2011-08-12 Thread jonathan.hwang
Hi All, I'm trying to decommission data node from my cluster. I put the data node in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name nodes. The under-replicated blocks are starting to replicate, but it's going down in a very slow pace. For 1 TB of data it takes over 1

RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread GOEKE, MATTHEW (AG/1000)
Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job. Side note: does anyone know what the rule of thumb on file size is when using the distributed cache vs

Re: Speed up node under replicated block during decomission

2011-08-12 Thread Charles Wimmer
The balancer bandwidth setting does not affect decommissioning nodes. Decommisssioning nodes replicate as fast as the cluster is capable. The replication pace has many variables. Number nodes that are participating in the replication. The amount of network bandwidth each has. The amount of

Re: Speed up node under replicated block during decomission

2011-08-12 Thread sridhar basam
On Fri, Aug 12, 2011 at 11:58 AM, wrote: > Hi All, > > I'm trying to decommission data node from my cluster. I put the data node > in the /usr/lib/hadoop/conf/dfs.hosts.exclude list and restarted the name > nodes. The under-replicated blocks are starting to replicate, but it's > going down in a

Re: Speed up node under replicated block during decomission

2011-08-12 Thread Joey Echeverria
You can configure the undocumented variable dfs.max-repl-streams to increase the number of replications a data-node is allowed to handle at one time. The default value is 2. [1] -Joey [1] https://issues.apache.org/jira/browse/HADOOP-2606?focusedCommentId=12578700&page=com.atlassian.jira.plugin.s

RE: Speed up node under replicated block during decomission

2011-08-12 Thread jonathan.hwang
I did have these settings on all the hdfs-site.xml nodes: dfs.balance.bandwidthPerSec 131072000 dfs.max-repl-streams 50 It is still taking over 1 day or longer for 1TB of under replicated blocks to replicate. Thanks! Jonathan -Original Message- From: Joey Echeverria [mai

Re: Speed up node under replicated block during decomission

2011-08-12 Thread Harsh J
It could be that your process has hung cause a particular resident block (file) requires a very large replication factor, and your remaining # of nodes is less than that value. This is a genuine reason for hang (but must be fixed). The process usually waits until there are no under-replicated block

What is the most efficient way to copy a large number of .gz files into HDFS?

2011-08-12 Thread W.P. McNeill
I have a large number of gzipped web server logs on NFS that I need to pull into HDFS for analysis by MapReduce. What is the most efficient way to do this? It seems like what I should do is: hadoop fs -copyFromLocal *.gz /my/HDFS/directory A couple of questions: 1. Is this single process, o

RE: Speed up node under replicated block during decomission

2011-08-12 Thread Michael Segel
Just a thought... Really quick and dirty thing to do is to turn off the node. Within 10 minutes the node looks down to the JT and NN so it gets marked as down. Run an fsck and it will show the files as under replicated and then will do the replication at the faster speed to rebalance the clust

Re: What is the most efficient way to copy a large number of .gz files into HDFS?

2011-08-12 Thread sridhar basam
On Fri, Aug 12, 2011 at 1:29 PM, W.P. McNeill wrote: > I have a large number of gzipped web server logs on NFS that I need to pull > into HDFS for analysis by MapReduce. What is the most efficient way to do > this? > > It seems like what I should do is: > > hadoop fs -copyFromLocal *.gz /my/HDFS

basic usage map/reduce error

2011-08-12 Thread Brown, Berlin [GCG-PFS]
I am getting this error with a mostly out of the box configuration from version 0.20.203.0 When I try to run the wordcount examples. $ hadoop jar hadoop-examples-0.20.203.0.jar wordcount /user/hduser/gutenberg /user/hduser/gutenberg-output6 2011-08-12 15:45:38,299 WARN org.apache.hadoop.mapred.

How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread W.P. McNeill
I'm building a Hadoop project using Maven. I want to add Maven dependencies to my project. What do I do? I think the answer is I add a section to my .POM file, but I'm not sure what the contents of this section (groupId, artifactId etc.) should be. Googling does not turn up a clear answer. Is the

Re: How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread Luke Lu
Pre-0.21 (sustaining releases, large-scale tested) hadoop: org.apache.hadoop hadoop-core 0.20.203.0 Pre-0.23 (small scale tested) hadoop: org.apache.hadoop hadoop-mapred ... Trunk (currently targeting 0.23.0, large-scale tested) hadoop WILL be: org.apache.hadoop hadoop-mapre

Re: How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread W.P. McNeill
I want the latest version of Hadoop (with the new API). I guess that's the trunk version, but I don't see the hadoop-mapreduce artifact listed on https://repository.apache.org/index.html#nexus-search;quick~hadoop On Fri, Aug 12, 2011 at 2:47 PM, Luke Lu wrote: > Pre-0.21 (sustaining releases, la

Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-12 Thread Sean Hogan
Hi all, I was interested in learning from how Hadoop implements their sort algorithm in the map/reduce framework. Could someone point me to the directory of the source code that has the mapper/reducer that the Sort example uses by default when I invoke: $ hadoop jar hadoop-*-examples.jar sort inp

Hadoop Cluster setup - no datanode

2011-08-12 Thread A Df
Hello Mates: Thanks to everyone for their help so far. I have learnt a lot and have now done single and pseudo mode. I have a hadoop cluster but I ran jps on the master node and slave node but not all process are started master: 22160 NameNode 22716 Jps 22458 JobTracker slave: 32195 Jps I als

Incompatible buildVersion jobtracker/tasktracker

2011-08-12 Thread Matt Matson
I'm getting this "Incompatible buildVersion" error even though they look to be the same build version (see below). The only difference is that the tasktracker is running on Debian-6.0 and the jobtracker on Centos-5.3. Are there known problems with this setup? 2011-08-12 23:27:10,362 WARN mapre

Re: Where is the hadoop-examples source code for the Sort example mapper/reducer?

2011-08-12 Thread Arun C Murthy
Sean, The sort impl is spread out over many files. I'd start with MapTask and ReduceTask and follow from there on. LMK if you need more info. thanks, Arun On Aug 12, 2011, at 12:48 PM, Sean Hogan wrote: > Hi all, > > I was interested in learning from how Hadoop implements their sort algori

Re: How do I add Hadoop dependency to a Maven project?

2011-08-12 Thread Luke Lu
There is a reason I capitalized WILL (SHALL) :) The current trunk mapreduce code is influx. Once mr2 (MAPREDUCE-279) is merged into trunk (soon!). We'll be producing hadoop-mapreduce-0.23.0-SNAPSHOT, which depends on hadoop-hdfs-0.23.0-SNAPSHOT, which depends on hadoop-common-0.23.0-SNAPSHOT. If

Re: Hadoop Cluster setup - no datanode

2011-08-12 Thread A Df
Hello: I did more test again but now I noticed that only 3 nodes have datanodes while the others do not. I ran the admin report tool and the result is below. Where do i configure the capacity?  bin/hadoop dfsadmin -report Configured Capacity: 0 (0 KB) Present Capacity: 0 (0 KB) DFS Remaining:

API Conflicts for Join Methods

2011-08-12 Thread Varad Meru
Hi All, Recently I was working with the CompositeInputFormat for MapSide Join and then tried MultipleInputs for ReduceSide join. The problem I found out was that both of these methods needed you to use JobConf class (@deprecated) and the implementation hasnt been provided in the 0.20.203 versio

Re: Hadoop Cluster setup - no datanode

2011-08-12 Thread A Df
Hello: After doing some more changes, the report shows only one datanode but different from the ones that I selected. The problem seems to be the datanode but I am not sure why. Does hadoop need to run as root? What if a shared filesystem is used for the nodes, how can I specify that the data

RE: basic usage map/reduce error

2011-08-12 Thread Brown, Berlin [GCG-PFS]
OK, that wasn't the real error, it looks like this was: When working with cygwin. I am guessing that the task failed. Is mapred/task runner launch a new jvm process? That seems to be failing MapAttempt TASK_TYPE="SETUP" TASKID="task_201108130149_0001_m_02" TASK_ATTEMPT_ID="attempt_2011081