How to select random n records using mapreduce ?

2011-06-27 Thread Jeff Zhang
Hi all,

I'd like to select random N records from a large amount of data using
hadoop, just wonder how can I archive this ? Currently my idea is that let
each mapper task select N / mapper_number records. Does anyone has such
experience ?


-- 
Best Regards

Jeff Zhang


Re: Comparing two logs, finding missing records

2011-06-27 Thread Rajesh Balamohan
I believe you meant,

SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid
WHERE LOG2.recordid is null. (this would produce set of records in LOG1 and
which are not present in LOG2).

In PIG, we have to add additional filter with is null condition.

~Rajesh.B

On Mon, Jun 27, 2011 at 6:34 AM, Bharath Mundlapudi
bharathw...@yahoo.comwrote:

 SQL:

 SELECT * FROM LOG1 LEFT OUTER JOIN LOG2 ON LOG1.recordid = LOG2.recordid;


 PIG:
 data = JOIN LOG1 BY recordid LEFT OUTER, LOG2 BY recordid;
 DUMP data;


 If you need more PIG help, please post in PIG email alias.

 -Bharath


 
 From: Mark Kerzner markkerz...@gmail.com
 To: common-user@hadoop.apache.org; Bharath Mundlapudi 
 bharathw...@yahoo.com
 Sent: Sunday, June 26, 2011 5:50 PM
 Subject: Re: Comparing two logs, finding missing records


 Bharath,

 how would a Pig query look like?

 Thank you,
 Mark


 On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi bharathw...@yahoo.com
 wrote:

 If you have Serde or PigLoader for your log format, probably Pig or Hive
 will be a quicker solution with the join.
 
 -Bharath
 
 
 
 
 From: Mark Kerzner markkerz...@gmail.com
 To: Hadoop Discussion Group core-u...@hadoop.apache.org
 Sent: Saturday, June 25, 2011 9:39 PM
 Subject: Comparing two logs, finding missing records
 
 
 Hi,
 
 I have two logs which should have all the records for the same record_id,
 in
 other words, if this record_id is found in the first log, it should also
 be
 found in the second one. However, I suspect that the second log is
 filtered
 out, and I need to find the missing records. Anything is allowed:
 MapReduce
 job, Hive, Pig, and even a NoSQL database.
 
 Thank you.
 
 It is also a good time to express my thanks to all the members of the
 group
 who are always very helpful.
 
 Sincerely,
 Mark




-- 
~Rajesh.B


tar or hadoop archive

2011-06-27 Thread Rita
We use hadoop/hdfs to archive data. I archive a lot of file by creating one
large tar file and then placing to hdfs. Is it better to use hadoop archive
for this or is it essentially the same thing?

-- 
--- Get your facts first, then you can distort them as you please.--


RE: Queue support from HDFS

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Saumitra,

Two questions come to mind that could help you narrow down a solution:

1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to 
batch the data into a blob that is a multiple of your block size, then you are 
obviously going to be working more towards the strong suit of vanilla MR.

2) What else will be running on the cluster?
Reason: If this is primarily setup for this use case then how often it 
runs / what resources it consumes when it does only needs to be optimized if it 
can't process them fast enough. If it is not then you could always setup a 
separate pool for this in the fairscheduler and allow for this to use a certain 
amount of overhead on the cluster when these events are being generated.

Outside of the fact that you would have a lot of small files on the cluster 
(which can be resolved by running a nightly job to blob them and then delete 
originals) I am not sure I would be too concerned about at least trying out 
this method. It would be helpful to know the size and type of data coming in as 
well as what type of operation you are looking to do if you would like a more 
concrete suggestion. Log data is a prime example of this type of workflow and 
there are many suggestions out there as well as projects that attempt to 
address this (i.e. Chukwa). 

HTH,
Matt

-Original Message-
From: saumitra.shahap...@gmail.com [mailto:saumitra.shahap...@gmail.com] On 
Behalf Of Saumitra Shahapure
Sent: Friday, June 24, 2011 12:12 PM
To: common-user@hadoop.apache.org
Subject: Queue support from HDFS

Hi,

Is queue-like structure supported from HDFS where stream of data is
processed when it's generated?
Specifically, I will have stream of data coming; and data independent
operation needs to be applied to it (so only Map function, reducer is
identity).
I wish to distribute data among nodes using HDFS and start processing it as
it arrives, preferably in single MR job.

I agree that it can be done by starting new MR job for each batch of data,
but is starting many MR jobs frequently for small data chunks a good idea?
(Consider new batch arrives after every few sec and processing of one batch
takes few mins)

Thanks,
-- 
Saumitra S. Shahapure
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



Re: error in reduce task

2011-06-27 Thread Steve Loughran

On 24/06/11 18:16, Niels Boldt wrote:

Hi,

I'm running nutch in pseudo cluster, eg all daemons are running on the same
server. I'm writing to the hadoop list, as it looks like a problem related
to hadoop

Some of my jobs partially fails and in the error log I get output like

2011-06-24 08:45:05,765 INFO org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_00_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)

2011-06-24 08:45:05,771 WARN org.apache.hadoop.mapred.ReduceTask:
attempt_201106231520_0190_r_00_0 copy failed:
attempt_201106231520_0190_m_00_0 from worker1
2011-06-24 08:45:05,772 WARN org.apache.hadoop.mapred.ReduceTask:
java.net.UnknownHostException: worker1




The above basically said that my worker is unknown, but I can't really make
any sense of it. Other jobs running before, at the same time or after
completes fine without any error messages and without any changes on the
server. Also other reduce task in the same run has succeded. So it looks
like that my worker sometimes 'disappear' and can't be reached.


If the worker had disappeared of the net, you'd be more likely to see 
a NoRouteToHost



My current theory is that it only happens when there are a couple of jobs
running at the same time. Is that a plausible explanation

Would anybody have some suggestions how I could get more infomation from the
system, or point me in a direction where I should look(I'm also quite new to
hadoop)


I'd assume that one machine in the cluster doesn't have an /etc/hosts 
entry to worker1, or that the DNS server is suffering under load. If you 
can, put the host lists into the /etc/hosts table instead of relying on 
DNS. If you do it on all machines, it avoids having to work out which 
one is playing up. That said, some better logging of which host is 
trying to make the connection would be nice


Re: Reading HDFS files via Spring

2011-06-27 Thread John Armstrong
On Sun, 26 Jun 2011 17:34:34 -0700, Mark static.void@gmail.com
wrote:
 Hello all,
 
 We have a recommendation system that reads in similarity data via a 
 Spring context.xml as follows:
 
 bean id=similarity 

class=org.apache.mahout.cf.taste.impl.similarity.file.FileItemSimilarity
 constructor-arg value=/var/data/similarity.data/
 /bean
 
 Is it possible to use Hadoop/HDFS with Spring? We would love to be able 
 to use something like:
 
 constructor-arg value=hdfs://user/mark/similarity.data
 
 Can this (easily) be accomplished?


I didn't have to do quite the same thing, but I was trying to load an
ApplicationContext using SpringBean files kept in HDFS.  It was pretty
straightforward to throw together an HDFSXMLApplicationContext class (and
some necessary supporting classes), so I'd be surprised if it would be hard
to tweak other Spring classes similarly.

In this case, though, it looks like your problem isn't actually with
Spring so much as it is with the FileItemSimilarity class; it doesn't have
a constructor which takes a Path argument.  You might be able to extend
that class and add the kind of constructor you want to use, though.


Re: Computing overlap of two files with hadoop

2011-06-27 Thread Claus Stadler

Hi,

I have posted the question to stackoverflow, where I have also 
clearified my problem a bit.


If you have a solution, please respond there (if its not too much of a 
hassle):


http://stackoverflow.com/questions/6469171/computing-set-intersection-and-set-difference-of-the-records-of-two-files-with-ha

Best regards,
Claus

On 06/24/2011 12:44 PM, Claus Stadler wrote:

Hi,

My problem is as follows:
I have two input files, and I want to determine

a) The number of lines which only occur in file 1
b) The number of lines which only occur in file 2
c) The number of lines common to both (e.g. in regard to string equality)

Exaple:
File 1:
a
b
c

File 2:
a
d

Desired output for each case:
lines_only_in_1: 2 (b, c)
lines_only_in_2: 1 (d)
lines_in_both:1 (a)

Basically my approach is as follows:
I wrote my own LineRecordReader, so that the mapper receives a pair 
consisting of the line (text) and a byte indicating the source file 
(either 0 or 1).

The mapper only returns the pair again so actually it does nothing.
However, the side effect is, that the combiner receives a
MapLine, IterableSourceId (where SourceId is either 0 or 1).

Now, for each line I can get the set of sources it appears in. 
Therefore, I could write a combiner that counts for each case (a, b, 
c) the number of lines (Listing 1)


The combiner then outputs a 'summary' only on cleanup (is that safe?).
So this summary looks like:

in_a_distinct_count_total   7531
in_b_distinct_count_total   3190
out_common_distinct_count_total 901

In the reducer I then only sum up the values for these summaries.


However, the main problem is, that I need to treat both source files 
as a single virtual file which yield records of the form

(line, sourceId)  // sourceId either 0 or 1

And I am not sure how to achieve that.
So the quesion is whether I can avoid preprocessing and mergind the 
files before hand, and do that on-the-fly with a something like a 
virtually-merged-file reader and custom record reader.

Any code example is much appreciated.

Best regards,
Claus


Listing 1:
public static class SourceCombiner
extends ReducerText, ByteWritable, Text, LongWritable
{
private long countA = 0;
private long countB = 0;
private long countC = 0; // C = lines (c)ommon to both sources

@Override
public void reduce(Text key, IterableByteWritable values, 
Context context) throws IOException, InterruptedException {

SetByte fileIds = new HashSetByte();
for (ByteWritable val : values) {
byte fileId = val.get();

fileIds.add(fileId);
}

if(fileIds.contains((byte)0)) { ++countA; }
if(fileIds.contains((byte)1)) { ++countB; }
if(fileIds.size() = 2) { ++countC; }
}

protected void cleanup(Context context)
throws java.io.IOException, java.lang.InterruptedException
{
context.write(new Text(in_a_distinct_count_total), new 
LongWritable(countA));
context.write(new Text(in_b_distinct_count_total), new 
LongWritable(countB));
context.write(new Text(out_common_distinct_count_total), new 
LongWritable(countC));

}










RE: How to select random n records using mapreduce ?

2011-06-27 Thread Habermaas, William
I did something similar.  Basically I had a random sampling algorithm that I 
called from the mapper. If it returned true I would collect the data, otherwise 
I would discard it. 

Bill 

-Original Message-
From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
Sent: Monday, June 27, 2011 3:29 PM
To: mapreduce-u...@hadoop.apache.org
Cc: core-u...@hadoop.apache.org
Subject: Re: How to select random n records using mapreduce ?

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that let
 each mapper task select N / mapper_number records. Does anyone has such
 experience ?

 --
 Best Regards

 Jeff Zhang




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes



RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
If you are running default configurations then you are only getting 2 mappers 
and 1 reducer per node. The rule of thumb I have gone on (and back up by the 
definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots 
left. How you break it up from there is your call but I would suggest either 4 
mappers / 2 reducers or 5 mappers / 1 reducer.

Check out the below configs for details on what you are *most likely* running 
currently:
http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

HTH,
Matt

-Original Message-
From: Juan P. [mailto:gordoslo...@gmail.com] 
Sent: Monday, June 27, 2011 2:50 PM
To: common-user@hadoop.apache.org
Subject: Performance Tunning

I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
cores each.
My input data is 4GB in size and it's split into 100MB files. Current
configuration is default so block size is 64MB.

If I understand it correctly Hadoop should be running 64 Mappers to process
the data.

I'm running a simple data counting MapReduce and it's taking about 30mins to
complete. This seems like way too much, doesn't it?
Is there any tunning you guys would recommend to try and see an improvement
in performance?

Thanks,
Pony
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.



RE: How to select random n records using mapreduce ?

2011-06-27 Thread Jeff.Schmitz
Wait - Habermaas like in Critical Theory

-Original Message-
From: Habermaas, William [mailto:william.haberm...@fatwire.com] 
Sent: Monday, June 27, 2011 2:55 PM
To: common-user@hadoop.apache.org
Subject: RE: How to select random n records using mapreduce ?

I did something similar.  Basically I had a random sampling algorithm
that I called from the mapper. If it returned true I would collect the
data, otherwise I would discard it. 

Bill 

-Original Message-
From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
Sent: Monday, June 27, 2011 3:29 PM
To: mapreduce-u...@hadoop.apache.org
Cc: core-u...@hadoop.apache.org
Subject: Re: How to select random n records using mapreduce ?

The only solution I can think of is by creating a counter in Hadoop
that is incremented each time a mapper lets a record through.
As soon as the value reaches a preselected value the mappers simply
discard the additional input they receive.

Note that this will not at all be random yet it's the best I can
come up with right now.

HTH

On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:

 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that
let
 each mapper task select N / mapper_number records. Does anyone has
such
 experience ?

 --
 Best Regards

 Jeff Zhang




-- 
Best regards / Met vriendelijke groeten,

Niels Basjes




Re: How to select random n records using mapreduce ?

2011-06-27 Thread Matt Pouttu-Clarke
If the incoming data is unique you can create a hash of the data and then do
a modulus of the hash to select a random set.  So if you wanted 10% of the
data randomly:

hash % 10 == 0

Gives a random 10%


On 6/27/11 12:54 PM, Habermaas, William william.haberm...@fatwire.com
wrote:

 I did something similar.  Basically I had a random sampling algorithm that I
 called from the mapper. If it returned true I would collect the data,
 otherwise I would discard it.
 
 Bill 
 
 -Original Message-
 From: ni...@basj.es [mailto:ni...@basj.es] On Behalf Of Niels Basjes
 Sent: Monday, June 27, 2011 3:29 PM
 To: mapreduce-u...@hadoop.apache.org
 Cc: core-u...@hadoop.apache.org
 Subject: Re: How to select random n records using mapreduce ?
 
 The only solution I can think of is by creating a counter in Hadoop
 that is incremented each time a mapper lets a record through.
 As soon as the value reaches a preselected value the mappers simply
 discard the additional input they receive.
 
 Note that this will not at all be random yet it's the best I can
 come up with right now.
 
 HTH
 
 On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote:
 
 Hi all,
 I'd like to select random N records from a large amount of data using
 hadoop, just wonder how can I archive this ? Currently my idea is that let
 each mapper task select N / mapper_number records. Does anyone has such
 experience ?
 
 --
 Best Regards
 
 Jeff Zhang
 
 
 


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi Everyone:

I am quite new to hadoop here. I am attempting to set up Hadoop locally in
two machines, connected by LAN. Both of them pass the single-node test.
However, I failed in two-node cluster setup, as shown in the 2 cases below:

1) set one as dedicated namenode and the other as dedicated datanode
2) set one as both name- and data-node, and the other as just datanode

I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
cleared, thus I can always observe the startup of daemon in every datanode.
However, by website of *http://(URI of namenode):50070 *it shows only 0 live
node for (1) and 1 live node for (2), which is the same as the output by
command-line *hadoop dfsadmin -report*

Generally it appears that from the namenode you cannot observe the remote
datanode alive, let alone a normal across-node MapReduce execution.

Could anyone give some hints / instructions at this point? I really
appreciate it!

Thank.

Best Regards
Yours Sincerely

Jingwei Lu


RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Did you make sure to define the datanode/tasktracker in the slaves file in your 
conf directory and push that to both machines? Also have you checked the logs 
on either to see if there are any errors?

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 3:24 PM
To: HADOOP MLIST
Subject: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Everyone:

I am quite new to hadoop here. I am attempting to set up Hadoop locally in
two machines, connected by LAN. Both of them pass the single-node test.
However, I failed in two-node cluster setup, as shown in the 2 cases below:

1) set one as dedicated namenode and the other as dedicated datanode
2) set one as both name- and data-node, and the other as just datanode

I launch *start-dfs.sh *on the namenode. Since I have all the *ssh *issues
cleared, thus I can always observe the startup of daemon in every datanode.
However, by website of *http://(URI of namenode):50070 *it shows only 0 live
node for (1) and 1 live node for (2), which is the same as the output by
command-line *hadoop dfsadmin -report*

Generally it appears that from the namenode you cannot observe the remote
datanode alive, let alone a normal across-node MapReduce execution.

Could anyone give some hints / instructions at this point? I really
appreciate it!

Thank.

Best Regards
Yours Sincerely

Jingwei Lu
This e-mail message may contain privileged and/or confidential information, and 
is intended to be received only by persons entitled
to receive such information. If you have received this e-mail in error, please 
notify the sender immediately. Please delete it and
all attachments from any servers, hard drives or any other media. Other use of 
this e-mail by you is strictly prohibited.

All e-mails and attachments sent and received are subject to monitoring, 
reading and archival by Monsanto, including its
subsidiaries. The recipient of this e-mail is solely responsible for checking 
for the presence of Viruses or other Malware.
Monsanto, along with its subsidiaries, accepts no liability for any damage 
caused by any such code transmitted by or accompanying
this e-mail or any attachment.


The information contained in this email may be subject to the export control 
laws and regulations of the United States, potentially
including but not limited to the Export Administration Regulations (EAR) and 
sanctions regulations issued by the U.S. Department of
Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this 
information you are obligated to comply with all
applicable U.S. export laws and regulations.


Re: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi,

I just manually modify the masters  slaves files in the both machines.

I found something wrong in the log files, as shown below:

-- Master :
namenote.log:


2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
SelectChannelConnector@0.0.0.0:50070
2011-06-27 13:44:47,395 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
0.0.0.0:50070
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 54310: starting
2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310: starting
2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 54310: starting
2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register
getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
at
org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage
DS-87816363-127.0.0.1-50010-1309207502566



-- slave:
datanode.log:


  1 2011-06-27 13:45:00,335 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
  2 /
  3 STARTUP_MSG: Starting DataNode
  4 STARTUP_MSG:   host = hdl.ucsd.edu/127.0.0.1
  5 STARTUP_MSG:   args = []
  6 STARTUP_MSG:   version = 0.20.2
  7 STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
  8 /
  9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s).
 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s).
 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s).
 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s).
 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 4 time(s).
 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s).
 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s).
 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s).
 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s).
 18 2011-06-27 13:45:11,678 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 9 time(s).
 19 2011-06-27 13:45:11,679 INFO org.apache.hadoop.ipc.RPC: Server at
hdl.ucsd.edu/127.0.0.1:54310 not available yet, Z...


(just guess, is this 

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jeff.Schmitz
http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html



-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 3:58 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

Hi,

I just manually modify the masters  slaves files in the both machines.

I found something wrong in the log files, as shown below:

-- Master :
namenote.log:


2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
SelectChannelConnector@0.0.0.0:50070
2011-06-27 13:44:47,395 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
0.0.0.0:50070
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 54310: starting
2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310: starting
2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 54310: starting
2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register
getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
at
org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage
DS-87816363-127.0.0.1-50010-1309207502566



-- slave:
datanode.log:


  1 2011-06-27 13:45:00,335 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
  2 /
  3 STARTUP_MSG: Starting DataNode
  4 STARTUP_MSG:   host = hdl.ucsd.edu/127.0.0.1
  5 STARTUP_MSG:   args = []
  6 STARTUP_MSG:   version = 0.20.2
  7 STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
  8 /
  9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s).
 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s).
 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s).
 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s).
 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 4 time(s).
 14 2011-06-27 13:45:07,643 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 5 time(s).
 15 2011-06-27 13:45:08,646 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 6 time(s).
 16 2011-06-27 13:45:09,661 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 7 time(s).
 17 2011-06-27 13:45:10,664 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 8 time(s).
 18 2011-06-27 13:45:11,678 INFO 

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
As a follow-up to what Jeff posted: go ahead and ignore the message you got on 
the NN for now.

If you look at the address that the DN log shows it is 127.0.0.1 and the 
ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it is 
trying to bind to itself as if it was still in single machine mode. Make sure 
that you have correctly pushed the URI for the NN into the config files on both 
machines and then bounce DFS.

Matt

-Original Message-
From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com] 
Sent: Monday, June 27, 2011 4:08 PM
To: common-user@hadoop.apache.org
Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup?

http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html



-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 3:58 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

Hi,

I just manually modify the masters  slaves files in the both machines.

I found something wrong in the log files, as shown below:

-- Master :
namenote.log:


2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
SelectChannelConnector@0.0.0.0:50070
2011-06-27 13:44:47,395 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
0.0.0.0:50070
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
Responder: starting
2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
listener on 54310: starting
2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 0 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 1 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 2 on 54310: starting
2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 54310: starting
2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 4 on 54310: starting
2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 5 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 6 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 7 on 54310: starting
2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 8 on 54310: starting
2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 9 on 54310: starting
2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register
getProtocolVersion
java.lang.IllegalArgumentException: Duplicate metricsName:getProtocolVersion
at
org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89)
at
org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:99)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:416)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
2011-06-27 13:45:02,572 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
NameSystem.registerDatanode: node registration from 127.0.0.1:50010 storage
DS-87816363-127.0.0.1-50010-1309207502566



-- slave:
datanode.log:


  1 2011-06-27 13:45:00,335 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
  2 /
  3 STARTUP_MSG: Starting DataNode
  4 STARTUP_MSG:   host = hdl.ucsd.edu/127.0.0.1
  5 STARTUP_MSG:   args = []
  6 STARTUP_MSG:   version = 0.20.2
  7 STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20 -r
911707; compiled by 'chrisdo' on Fri Feb 19 08:07:34 UTC 2010
  8 /
  9 2011-06-27 13:45:02,476 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 0 time(s).
 10 2011-06-27 13:45:03,549 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 1 time(s).
 11 2011-06-27 13:45:04,552 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 2 time(s).
 12 2011-06-27 13:45:05,609 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. Already tried 3 time(s).
 13 2011-06-27 13:45:06,640 INFO org.apache.hadoop.ipc.Client: Retrying
connect to server: hdl.ucsd.edu/127.0.0.1:54310. 

Re: Performance Tunning

2011-06-27 Thread Juan P.
Matt,
Thanks for your help!
I think I get it now, but this part is a bit confusing:
*
*
*so: tasktracker/datanode and 6 slots left. How you break it up from there
is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
/ 1 reducer.*
*
*
If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
Processes/Core = 32 Processes Total

So my configuration mapred-site.xml should include these props:

*property*
*  namemapred.map.tasks/name*
*  value28/value*
*/property*
*property*
*  namemapred.reduce.tasks/name*
*  value4/value*
*/property*
*
*

Is that correct?

On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 If you are running default configurations then you are only getting 2
 mappers and 1 reducer per node. The rule of thumb I have gone on (and back
 up by the definitive guide) is 2 processes per core so: tasktracker/datanode
 and 6 slots left. How you break it up from there is your call but I would
 suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

 Check out the below configs for details on what you are *most likely*
 running currently:
 http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

 HTH,
 Matt

 -Original Message-
 From: Juan P. [mailto:gordoslo...@gmail.com]
 Sent: Monday, June 27, 2011 2:50 PM
 To: common-user@hadoop.apache.org
 Subject: Performance Tunning

 I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
 cores each.
 My input data is 4GB in size and it's split into 100MB files. Current
 configuration is default so block size is 64MB.

 If I understand it correctly Hadoop should be running 64 Mappers to process
 the data.

 I'm running a simple data counting MapReduce and it's taking about 30mins
 to
 complete. This seems like way too much, doesn't it?
 Is there any tunning you guys would recommend to try and see an improvement
 in performance?

 Thanks,
 Pony
 This e-mail message may contain privileged and/or confidential information,
 and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other use
 of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.




Re: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread Jingwei Lu
Hi Matt and Jeff:

Thanks a lot for your instructions. I corrected the mistakes in conf files
of DN, and now the log on DN becomes:

2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s).
2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s).
2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s).
2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s).
2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s).
2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s).
2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s).
2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s).
2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s).
2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s).
2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at
clock.ucsd.edu/132.239.95.91:54310 not available yet, Z...

Seems DN is trying to bind with NN but always fails...



Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 As a follow-up to what Jeff posted: go ahead and ignore the message you got
 on the NN for now.

 If you look at the address that the DN log shows it is 127.0.0.1 and the
 ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it
 is trying to bind to itself as if it was still in single machine mode. Make
 sure that you have correctly pushed the URI for the NN into the config files
 on both machines and then bounce DFS.

 Matt

 -Original Message-
 From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com]
 Sent: Monday, June 27, 2011 4:08 PM
 To: common-user@hadoop.apache.org
 Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup?

 http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html



 -Original Message-
 From: Jingwei Lu [mailto:j...@ucsd.edu]
 Sent: Monday, June 27, 2011 3:58 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

 Hi,

 I just manually modify the masters  slaves files in the both machines.

 I found something wrong in the log files, as shown below:

 -- Master :
 namenote.log:

 
 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
 0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 Responder: starting
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 54310: starting
 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 2 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 54310: starting
 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 54310: starting
 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 5 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 6 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 7 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 8 on 54310: starting
 2011-06-27 13:44:47,408 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 9 on 54310: starting
 2011-06-27 13:44:47,500 INFO org.apache.hadoop.ipc.Server: Error register
 getProtocolVersion
 java.lang.IllegalArgumentException: Duplicate
 metricsName:getProtocolVersion
 at
 org.apache.hadoop.metrics.util.MetricsRegistry.add(MetricsRegistry.java:53)
 at

 org.apache.hadoop.metrics.util.MetricsTimeVaryingRate.init(MetricsTimeVaryingRate.java:89)
 at

 

Re: tar or hadoop archive

2011-06-27 Thread Joey Echeverria
Yes, you can see a picture describing HAR files in this old blog post:

http://www.cloudera.com/blog/2009/02/the-small-files-problem/

-Joey

On Mon, Jun 27, 2011 at 4:36 PM, Rita rmorgan...@gmail.com wrote:
 So, it does an index of the file?



 On Mon, Jun 27, 2011 at 10:10 AM, Joey Echeverria j...@cloudera.com wrote:

 The advantage of a hadoop archive files is it lets you access the
 files stored in it directly. For example, if you archived three files
 (a.txt, b.txt, c.txt) in an archive called foo.har. You could cat one
 of the three files using the hadoop command line:

 hadoop fs -cat har:///user/joey/out/foo.har/a.txt

 You can also copy files out of the archive or use files in the archive
 as input to map reduce jobs.

 -Joey

 On Mon, Jun 27, 2011 at 3:06 AM, Rita rmorgan...@gmail.com wrote:
  We use hadoop/hdfs to archive data. I archive a lot of file by creating
 one
  large tar file and then placing to hdfs. Is it better to use hadoop
 archive
  for this or is it essentially the same thing?
 
  --
  --- Get your facts first, then you can distort them as you please.--
 



 --
 Joseph Echeverria
 Cloudera, Inc.
 443.305.9434




 --
 --- Get your facts first, then you can distort them as you please.--




-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434


Re: Performance Tunning

2011-06-27 Thread Juan P.
Ok,
So I tried putting the following config in the mapred-site.xml of all of my
nodes

configuration
  property
namemapred.job.tracker/name
valuename-node:54311/value
  /property
  property
namemapred.map.tasks/name
value7/value
  /property
  property
namemapred.reduce.tasks/name
value1/value
  /property
  property
namemapred.tasktracker.map.tasks.maximum/name
value7/value
  /property
  property
namemapred.tasktracker.reduce.tasks.maximum/name
value1/value
  /property
/configuration

but when I start a new job it gets stuck at

11/06/28 03:04:47 INFO mapred.JobClient:  map 0% reduce 0%

Any thoughts?
Thanks for your help guys!

On Mon, Jun 27, 2011 at 7:33 PM, Juan P. gordoslo...@gmail.com wrote:

 Matt,
 Thanks for your help!
 I think I get it now, but this part is a bit confusing:
 *
 *
 *so: tasktracker/datanode and 6 slots left. How you break it up from there
 is your call but I would suggest either 4 mappers / 2 reducers or 5 mappers
 / 1 reducer.*
 *
 *
 If it's 2 processes per core, then it's: 4 Nodes * 4 Cores/Node * 2
 Processes/Core = 32 Processes Total

 So my configuration mapred-site.xml should include these props:

 *property*
 *  namemapred.map.tasks/name*
 *  value28/value*
 */property*
 *property*
 *  namemapred.reduce.tasks/name*
 *  value4/value*
 */property*
 *
 *

 Is that correct?

 On Mon, Jun 27, 2011 at 4:59 PM, GOEKE, MATTHEW (AG/1000) 
 matthew.go...@monsanto.com wrote:

 If you are running default configurations then you are only getting 2
 mappers and 1 reducer per node. The rule of thumb I have gone on (and back
 up by the definitive guide) is 2 processes per core so: tasktracker/datanode
 and 6 slots left. How you break it up from there is your call but I would
 suggest either 4 mappers / 2 reducers or 5 mappers / 1 reducer.

 Check out the below configs for details on what you are *most likely*
 running currently:
 http://hadoop.apache.org/common/docs/r0.20.2/mapred-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/hdfs-default.html
 http://hadoop.apache.org/common/docs/r0.20.2/core-default.html

 HTH,
 Matt

 -Original Message-
 From: Juan P. [mailto:gordoslo...@gmail.com]
 Sent: Monday, June 27, 2011 2:50 PM
 To: common-user@hadoop.apache.org
 Subject: Performance Tunning

 I'm trying to run a MapReduce task against a cluster of 4 DataNodes with 4
 cores each.
 My input data is 4GB in size and it's split into 100MB files. Current
 configuration is default so block size is 64MB.

 If I understand it correctly Hadoop should be running 64 Mappers to
 process
 the data.

 I'm running a simple data counting MapReduce and it's taking about 30mins
 to
 complete. This seems like way too much, doesn't it?
 Is there any tunning you guys would recommend to try and see an
 improvement
 in performance?

 Thanks,
 Pony
 This e-mail message may contain privileged and/or confidential
 information, and is intended to be received only by persons entitled
 to receive such information. If you have received this e-mail in error,
 please notify the sender immediately. Please delete it and
 all attachments from any servers, hard drives or any other media. Other
 use of this e-mail by you is strictly prohibited.

 All e-mails and attachments sent and received are subject to monitoring,
 reading and archival by Monsanto, including its
 subsidiaries. The recipient of this e-mail is solely responsible for
 checking for the presence of Viruses or other Malware.
 Monsanto, along with its subsidiaries, accepts no liability for any damage
 caused by any such code transmitted by or accompanying
 this e-mail or any attachment.


 The information contained in this email may be subject to the export
 control laws and regulations of the United States, potentially
 including but not limited to the Export Administration Regulations (EAR)
 and sanctions regulations issued by the U.S. Department of
 Treasury, Office of Foreign Asset Controls (OFAC).  As a recipient of this
 information you are obligated to comply with all
 applicable U.S. export laws and regulations.





RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
At this point if that is the correct ip then I would see if you can actually 
ssh from the DN to the NN to make sure it can actually connect to the other 
box. If you can successfully connect through ssh then it's just a matter of 
figuring out why that port is having issues (netstat is your friend in this 
case). If you see it listening on 54310 then just power cycle the box and try 
again.

Matt

-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu] 
Sent: Monday, June 27, 2011 5:38 PM
To: common-user@hadoop.apache.org
Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

Hi Matt and Jeff:

Thanks a lot for your instructions. I corrected the mistakes in conf files
of DN, and now the log on DN becomes:

2011-06-27 15:32:36,025 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 0 time(s).
2011-06-27 15:32:37,028 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 1 time(s).
2011-06-27 15:32:38,031 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 2 time(s).
2011-06-27 15:32:39,034 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 3 time(s).
2011-06-27 15:32:40,037 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 4 time(s).
2011-06-27 15:32:41,040 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 5 time(s).
2011-06-27 15:32:42,043 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 6 time(s).
2011-06-27 15:32:43,046 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 7 time(s).
2011-06-27 15:32:44,049 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 8 time(s).
2011-06-27 15:32:45,052 INFO org.apache.hadoop.ipc.Client: Retrying connect
to server: clock.ucsd.edu/132.239.95.91:54310. Already tried 9 time(s).
2011-06-27 15:32:45,053 INFO org.apache.hadoop.ipc.RPC: Server at
clock.ucsd.edu/132.239.95.91:54310 not available yet, Z...

Seems DN is trying to bind with NN but always fails...



Best Regards
Yours Sincerely

Jingwei Lu



On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) 
matthew.go...@monsanto.com wrote:

 As a follow-up to what Jeff posted: go ahead and ignore the message you got
 on the NN for now.

 If you look at the address that the DN log shows it is 127.0.0.1 and the
 ip:port it is trying to connect to for the NN is 127.0.0.1:54310 --- it
 is trying to bind to itself as if it was still in single machine mode. Make
 sure that you have correctly pushed the URI for the NN into the config files
 on both machines and then bounce DFS.

 Matt

 -Original Message-
 From: jeff.schm...@shell.com [mailto:jeff.schm...@shell.com]
 Sent: Monday, June 27, 2011 4:08 PM
 To: common-user@hadoop.apache.org
 Subject: RE: Why I cannot see live nodes in a LAN-based cluster setup?

 http://www.mentby.com/tim-robertson/error-register-getprotocolversion.html



 -Original Message-
 From: Jingwei Lu [mailto:j...@ucsd.edu]
 Sent: Monday, June 27, 2011 3:58 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Why I cannot see live nodes in a LAN-based cluster setup?

 Hi,

 I just manually modify the masters  slaves files in the both machines.

 I found something wrong in the log files, as shown below:

 -- Master :
 namenote.log:

 
 2011-06-27 13:44:47,055 INFO org.mortbay.log: jetty-6.1.14
 2011-06-27 13:44:47,394 INFO org.mortbay.log: Started
 SelectChannelConnector@0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Web-server up at:
 0.0.0.0:50070
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 Responder: starting
 2011-06-27 13:44:47,395 INFO org.apache.hadoop.ipc.Server: IPC Server
 listener on 54310: starting
 2011-06-27 13:44:47,396 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 0 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 1 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 2 on 54310: starting
 2011-06-27 13:44:47,397 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 3 on 54310: starting
 2011-06-27 13:44:47,402 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 4 on 54310: starting
 2011-06-27 13:44:47,404 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 5 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 6 on 54310: starting
 2011-06-27 13:44:47,406 INFO org.apache.hadoop.ipc.Server: IPC Server
 handler 7 on 54310: