Re: wrong value class error

2010-11-16 Thread Alex Baranau
The message refers to the value not being an IntWritable, which is an
*input* value type of your reducer (and the output value type of your
mapper). Looks like you have a problem with mapper, not reducer.

Alex Baranau

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase

On Mon, Nov 15, 2010 at 11:50 PM, Arindam Khaled  wrote:

> Hello,
>
> I am new to Hadoop. I am getting the following error in my reducer.
>
> 10/11/15 15:29:11 WARN mapred.LocalJobRunner: job_local_0001
> java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is
> not class org.apache.hadoop.io.IntWritable
>
> Here is my reduce class:
>
>  public static class BFIDAReducer
>   extends Reducer {
>private Text result = new Text();
>
>public void reduce(Text key, Iterable values,
>   Context context
>   ) throws IOException, InterruptedException {
>  Text result = new Text();
>  GameFunctions gf = GameFunctions.getInstance();
>
>
>  String line = "";
>
>  for(IntWritable val: values)
>{
>line = line + val.toString() + ",";
>}
>
>if(line.length() > 1)
>line = (String) line.subSequence(0, line.length() - 1);
>
>if (gf.isSolved(key.toString(), size))
>solved = true;
>
>  result.set(line);
>  context.write(key, result);
>}
>  }
>
> And here is my partial code from job configuration:
>
>job.setOutputKeyClass(Text.class);
>job.setOutputValueClass(Text.class);
>job.setMapOutputKeyClass(Text.class);
>job.setMapOutputValueClass(IntWritable.class);
>
> Can anyone help me?
>
> Thanks in advance.
>
> Arindam
>


Re: Problem identifying cause of a failed job

2010-11-16 Thread Sudhir Vallamkondu
Try upgrading to JVM 6.0_21. We have had JVM issues with 6.0.18 and Hadoop.


On 11/16/10 4:58 PM, "common-user-digest-h...@hadoop.apache.org"
 wrote:

> From: Greg Langmead 
> Date: Tue, 16 Nov 2010 17:50:17 -0500
> To: 
> Subject: Problem identifying cause of a failed job
> 
> Newbie alert.
> 
> I have a Pig script I tested on small data and am now running it on a larger
> data set (85GB). My cluster is two machines right now, each with 16 cores
> and 32G of ram. I configured Hadoop to have 15 tasktrackers on each of these
> nodes. One of them is the namenode, one is the secondary name node. I¹m
> using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora Core
> 12, 64-bit.
> 
> My Pig job starts, and eventually a reduce task fails. I¹d like to find out
> why. Here¹s what I know:
> 
> The webUI lists the failed reduce tasks and indicates this error:
> 
> java.io.IOException: Task process exit with nonzero status of 134.
> at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)
> 
> The userlog userlogs/attempt_201011151350_0001_r_63_0/stdout says this:
> 
> #
> # A fatal error has been detected by the Java Runtime Environment:
> #
> #  SIGSEGV (0xb) at pc=0x7ff74158463c, pid=27109, tid=140699912791824
> #
> # JRE version: 6.0_18-b07
> # Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
> linux-amd64 )
> [thread 140699484784400 also had an error]# Problematic frame:
> 
> # V  [libjvm.so+0x62263c]
> #
> # An error report file with more information is saved as:
> # 
> /tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201011151350_0001/a
> ttempt_201011151350_0001_r_63_0/work/hs_err_pid27109.log
> #
> # If you would like to submit a bug report, please visit:
> #   http://java.sun.com/webapps/bugreport/crash.jsp
> #
> 
> My mapred-site.xml already includes this:
> 
> 
> keep.failed.task.files
> true
> 
> 
> So I was hoping that the file hs_err_pid27109.log would exist but it
> doesn¹t. I was sure to check the /tmp dir on both tasktrackers. In fact
> there is no dir  
> 
>   jobcache/job_201011151350_0001/attempt_201011151350_0001_r_63_0
> 
> only
> 
>   
> jobcache/job_201011151350_0001/attempt_201011151350_0001_r_63_0.cleanup
> 
> I¹d like to find the source of the segfault, can anyone point me in the
> right direction? 
> 
> Of course let me know if you need more information!


iCrossing Privileged and Confidential Information
This email message is for the sole use of the intended recipient(s) and may 
contain confidential and privileged information of iCrossing. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not the 
intended recipient, please contact the sender by reply email and destroy all 
copies of the original message.




Announcing HPC Instances on Amazon Elastic MapReduce

2010-11-16 Thread Gray, Adam
We are excited to announce that Amazon Elastic 
MapReduce
 can now take advantage of Cluster Compute (cc1) and Cluster GPU (cg1) 
instances, giving you the ability to combine Hadoop's massively parallelized 
architecture with high performance computing. You can focus on developing HPC 
applications and Elastic MapReduce will handle workload parallelization, node 
configuration and scaling, and cluster management. Further, Elastic MapReduce 
applications that are I/O intensive have the opportunity to realize performance 
improvement by leveraging the low latency, full bisection bandwidth 10 Gbps 
Ethernet network between the instances in the cluster.

Cluster Compute Quadruple Extra Large (cc1.4xlarge) instances have the 
following configuration:

 *   A pair of quad-core Intel 
"Nehalem"
 
X5570
 processors with a total of 33.5 ECU (EC2 Compute Units)
 *   23 GB of RAM
 *   1690 GB of local instance storage
 *   10 Gbps networking with network locality between nodes

Cluster GPU Quadruple Extra Large (cg1.4xlarge) instances add a pair of NVIDIA 
Tesla® M2050 
GPUs
 and when launched with Elastic MapReduce have the NVIDIA GPU driver and 
CUDA
 runtime installed.

Please refer to the High Performance 
Computing
 AWS Solutions page for more details on how HPC instances can be leveraged.

Sincerely,

The Amazon Elastic MapReduce Team

Amazon Web Services LLC is a subsidiary of Amazon.com, Inc. Amazon.com is a 
registered trademark of Amazon.com, Inc. This message produced and distributed 
by Amazon Web Services, LLC, 1200 12th Ave South, Seattle, WA 98144.
[http://www.amazon.com/gp/r.html?R=3Q89S9WPYQKE1&C=34S4XKTAXGMJV&H=GPVBDDQBYZABZXXWS2LBMBHA2FSA&T=TE&U=http%3A%2F%2Fimages.amazon.com%2Fimages%2FG%2F01%2Fnav%2Ftransp.gif]


Problem identifying cause of a failed job

2010-11-16 Thread Greg Langmead
Newbie alert.

I have a Pig script I tested on small data and am now running it on a larger
data set (85GB). My cluster is two machines right now, each with 16 cores
and 32G of ram. I configured Hadoop to have 15 tasktrackers on each of these
nodes. One of them is the namenode, one is the secondary name node. I¹m
using Pig 0.7.0 and Hadoop 0.20.2 with Java 1.6.0_18 on Linux Fedora Core
12, 64-bit.

My Pig job starts, and eventually a reduce task fails. I¹d like to find out
why. Here¹s what I know:

The webUI lists the failed reduce tasks and indicates this error:

java.io.IOException: Task process exit with nonzero status of 134.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

The userlog userlogs/attempt_201011151350_0001_r_63_0/stdout says this:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x7ff74158463c, pid=27109, tid=140699912791824
#
# JRE version: 6.0_18-b07
# Java VM: Java HotSpot(TM) 64-Bit Server VM (16.0-b13 mixed mode
linux-amd64 )
[thread 140699484784400 also had an error]# Problematic frame:

# V  [libjvm.so+0x62263c]
#
# An error report file with more information is saved as:
# 
/tmp/hadoop-hadoop/mapred/local/taskTracker/jobcache/job_201011151350_0001/a
ttempt_201011151350_0001_r_63_0/work/hs_err_pid27109.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#

My mapred-site.xml already includes this:


keep.failed.task.files
true


So I was hoping that the file hs_err_pid27109.log would exist but it
doesn¹t. I was sure to check the /tmp dir on both tasktrackers. In fact
there is no dir  

  jobcache/job_201011151350_0001/attempt_201011151350_0001_r_63_0

only

  
jobcache/job_201011151350_0001/attempt_201011151350_0001_r_63_0.cleanup

I¹d like to find the source of the segfault, can anyone point me in the
right direction? 

Of course let me know if you need more information!

Greg Langmead | Senior Research Scientist | SDL Language Weaver | (t) +1 310
437 7300
  
SDL PLC confidential, all rights reserved.
If you are not the intended recipient of this mail SDL requests and requires 
that you delete it without acting upon or copying any of its contents, and we 
further request that you advise us.
SDL PLC is a public limited company registered in England and Wales.  
Registered number: 02675207.
Registered address: Globe House, Clivemont Road, Maidenhead, Berkshire SL6 7DY, 
UK.


RE: Generic Performance Tuning of MapReduce

2010-11-16 Thread Michael Segel

Yeah... I would want to add one thing...

Its important to understand what each of the parameters are doing and how to 
apply it to your cluster.
You may have less cores/cpus, less/more memory, etc ... 

And if you're running hbase on top of hadoop...

So your mileage will definitely vary.

> From: sanjay.sha...@impetus.co.in
> To: common-user@hadoop.apache.org
> Date: Tue, 16 Nov 2010 22:43:47 +0530
> Subject: RE: Generic Performance Tuning of MapReduce
> 
> You could look at one of the old papers here- 
> http://code.google.com/p/hadoop-toolkit/downloads/detail?name=White%20paper-HadoopPerformanceTuning.pdf&can=2&q=
> 
> 
> Regards,
> Sanjay Sharma
> 
> 
> -Original Message-
> From: bichonfrise74 [mailto:bichonfris...@gmail.com]
> Sent: Tuesday, November 16, 2010 1:07 AM
> To: common-user@hadoop.apache.org
> Subject: Generic Performance Tuning of MapReduce
> 
> I have been looking around on some configuration parameters to improve the
> performance of MapReduce.
> 
> Basically, I'm looking at the mapred-site.xml and so far I have set the
> following values:
> 
> mapred.tasktracker.map.tasks.maximum = 40
> mapred.tasktracker.reduce.tasks.maximum = 8
> mapred.child.java.opts = -Xmx300m
> 
> Are there any generic values that I can placed inside the mapred-site.xml to
> improve the overall performance of MapReduce.
> 
> Thanks.
> 
> Impetus is a proud sponsor for ASCI Tour 2010 (Agile Software Community of 
> India) on Oct 30 in Noida, India.
> 
> Meet Impetus at the Cloud Computing Expo from Nov 1-4 in Santa Clara. Our Sr. 
> Director of Engineering, Vineet Tyagi will be speaking about ‘Using Hadoop 
> for Deriving Intelligence from Large Data’.
> 
> Click http://www.impetus.com/ to know more. Follow us on 
> www.twitter.com/impetuscalling
> 
> NOTE: This message may contain information that is confidential, proprietary, 
> privileged or otherwise protected by law. The message is intended solely for 
> the named addressee. If received in error, please destroy and notify the 
> sender. Any use of this email is prohibited when received in error. Impetus 
> does not represent, warrant and/or guarantee, that the integrity of this 
> communication has been maintained nor that the communication is free of 
> errors, virus, interception or interference.
  

Re: How does hadoop uses ssh

2010-11-16 Thread Brian Bockelman
To be clear,

You only need to use SSH if you don't have any other way to start processes on 
your worker nodes.  Lots of larger "production" sites have ways to manage this 
without SSH, but this really gets down to whatever the site prefers (and their 
security team allows).

Brian

On Nov 16, 2010, at 7:15 PM, Arun C Murthy wrote:

> 
> On Nov 16, 2010, at 10:04 AM, rahul wrote:
> 
>> Hi,
>> 
>> I have one question regarding the use of password less ssh login by the 
>> hadoop user across the hosts.
>> 
>> I want to understand when hadoop does password less ssh, is it once or can 
>> happen any time and is there a defined way to track that.
>> 
>> As this raises security concerns how to deal with it ?
>> 
>> Hadoop itself does not use SSH keys other then for this startup is this 
>> statement true ?
>> 
> 
> ssh is used only by the helper scripts to start the daemons (DataNodes and 
> TaskTrackers). The Hadoop software framework by itself doesn't use ssh.
> 
> Arun



smime.p7s
Description: S/MIME cryptographic signature


Re: Generic Performance Tuning of MapReduce

2010-11-16 Thread bichonfrise74
Thank you for the documentation. It really helped.

I have this setup:

5 nodes (1 master, 4 slaves) each with 4 CPU (Xeon 2.4 GHz) and 4 GB memory.

Based on the documentations that were provided, it looks like I can set the
following parameters.

mapred-site.xml,

mapred.job.reuse.jvm.num.tasks = 5 (no basis, I am just increasing it)
mapreduce.jobtracker.handler.count = 32 (no basis, I am just increasing it)
mapred.tasktracker.map.tasks.maximum = 4
mapred.tasktracker.reduce.tasks.maximum = 4
mapreduce.task.io.sort.factor = 100
mapreduce.map.output.compress = true
mapreduce.compress.map.output = true

hdfs-site.xml,

dfs.namenode.handler.count = 64
dfs.block.size = 128

Any comments on the above parameters? How do you know if the above
parameters are improving the mapreduce job? Will it be enough if I just
based it on the elapsed time that it takes for the job to finish?

Thanks.



On Tue, Nov 16, 2010 at 9:13 AM, Sanjay Sharma
wrote:

> You could look at one of the old papers here-
> http://code.google.com/p/hadoop-toolkit/downloads/detail?name=White%20paper-HadoopPerformanceTuning.pdf&can=2&q=
>
>
> Regards,
> Sanjay Sharma
>
>
> -Original Message-
> From: bichonfrise74 [mailto:bichonfris...@gmail.com]
> Sent: Tuesday, November 16, 2010 1:07 AM
> To: common-user@hadoop.apache.org
> Subject: Generic Performance Tuning of MapReduce
>
> I have been looking around on some configuration parameters to improve the
> performance of MapReduce.
>
> Basically, I'm looking at the mapred-site.xml and so far I have set the
> following values:
>
> mapred.tasktracker.map.tasks.maximum = 40
> mapred.tasktracker.reduce.tasks.maximum = 8
> mapred.child.java.opts = -Xmx300m
>
> Are there any generic values that I can placed inside the mapred-site.xml
> to
> improve the overall performance of MapReduce.
>
> Thanks.
>
> Impetus is a proud sponsor for ASCI Tour 2010 (Agile Software Community of
> India) on Oct 30 in Noida, India.
>
> Meet Impetus at the Cloud Computing Expo from Nov 1-4 in Santa Clara. Our
> Sr. Director of Engineering, Vineet Tyagi will be speaking about ‘Using
> Hadoop for Deriving Intelligence from Large Data’.
>
> Click http://www.impetus.com/ to know more. Follow us on
> www.twitter.com/impetuscalling
>
> NOTE: This message may contain information that is confidential,
> proprietary, privileged or otherwise protected by law. The message is
> intended solely for the named addressee. If received in error, please
> destroy and notify the sender. Any use of this email is prohibited when
> received in error. Impetus does not represent, warrant and/or guarantee,
> that the integrity of this communication has been maintained nor that the
> communication is free of errors, virus, interception or interference.
>


Re: How does hadoop uses ssh

2010-11-16 Thread Harsh J
Since this is being asked often, I've also added it to the wiki's FAQ:
http://wiki.apache.org/hadoop/FAQ#Does_Hadoop_require_SSH.3F

On Tue, Nov 16, 2010 at 11:45 PM, Arun C Murthy  wrote:
>
> ssh is used only by the helper scripts to start the daemons (DataNodes and
> TaskTrackers). The Hadoop software framework by itself doesn't use ssh.
>
> Arun
>
>

-- 
Harsh J
www.harshj.com


Re: How does hadoop uses ssh

2010-11-16 Thread Arun C Murthy


On Nov 16, 2010, at 10:04 AM, rahul wrote:


Hi,

I have one question regarding the use of password less ssh login by  
the hadoop user across the hosts.


I want to understand when hadoop does password less ssh, is it once  
or can happen any time and is there a defined way to track that.


As this raises security concerns how to deal with it ?

Hadoop itself does not use SSH keys other then for this startup is  
this statement true ?




ssh is used only by the helper scripts to start the daemons (DataNodes  
and TaskTrackers). The Hadoop software framework by itself doesn't use  
ssh.


Arun



Resources on building Hadoop with Apache Harmony Select

2010-11-16 Thread Guillermo Cabrera


Hello:

For the past few months, we have been working on getting Hadoop (common,
hdfs and mapreduce) working on Apache Harmony Select. We have documented
the steps we followed, issues we encountered and scripts we used in this
process. Please refer to the following link for further information.

http://wiki.apache.org/hadoop/HadoopOnHarmony

Regards,
Guillermo
--
IBM Emerging Internet Technology Group

How does hadoop uses ssh

2010-11-16 Thread rahul
Hi,

I have one question regarding the use of password less ssh login by the hadoop 
user across the hosts.

I want to understand when hadoop does password less ssh, is it once or can 
happen any time and is there a defined way to track that. 

As this raises security concerns how to deal with it ?

Hadoop itself does not use SSH keys other then for this startup is this 
statement true ?

Any inputs with respect to the ssh use by hadoop is helpful.

Thanks,
Rahul

Re: Caution using Hadoop 0.21

2010-11-16 Thread Steve Lewis
Two reasons -
1) we want a unit test to log whenever a write occurs
2) I want the keys generated by a write in a subsection of the app  to be
augmented by added data before being sent to hadoop


On Mon, Nov 15, 2010 at 11:21 PM, Owen O'Malley  wrote:

> I'm very sorry that you got burned by the change. Most MapReduce
> applications don't extend the Context classes since those are objects that
> are provided by the framework. In 0.21, we've marked which interfaces are
> stable and which are still evolving. We try and hold all of the interfaces
> stable, but evolving ones do change as we figure out what they should look
> like.
>
> Can I ask why you were extending the Context classes?
>
> -- Owen
>



-- 
Steven M. Lewis PhD
4221 105th Ave Ne
Kirkland, WA 98033
206-384-1340 (cell)
Institute for Systems Biology
Seattle WA


RE: Generic Performance Tuning of MapReduce

2010-11-16 Thread Sanjay Sharma
You could look at one of the old papers here- 
http://code.google.com/p/hadoop-toolkit/downloads/detail?name=White%20paper-HadoopPerformanceTuning.pdf&can=2&q=


Regards,
Sanjay Sharma


-Original Message-
From: bichonfrise74 [mailto:bichonfris...@gmail.com]
Sent: Tuesday, November 16, 2010 1:07 AM
To: common-user@hadoop.apache.org
Subject: Generic Performance Tuning of MapReduce

I have been looking around on some configuration parameters to improve the
performance of MapReduce.

Basically, I'm looking at the mapred-site.xml and so far I have set the
following values:

mapred.tasktracker.map.tasks.maximum = 40
mapred.tasktracker.reduce.tasks.maximum = 8
mapred.child.java.opts = -Xmx300m

Are there any generic values that I can placed inside the mapred-site.xml to
improve the overall performance of MapReduce.

Thanks.

Impetus is a proud sponsor for ASCI Tour 2010 (Agile Software Community of 
India) on Oct 30 in Noida, India.

Meet Impetus at the Cloud Computing Expo from Nov 1-4 in Santa Clara. Our Sr. 
Director of Engineering, Vineet Tyagi will be speaking about ‘Using Hadoop for 
Deriving Intelligence from Large Data’.

Click http://www.impetus.com/ to know more. Follow us on 
www.twitter.com/impetuscalling

NOTE: This message may contain information that is confidential, proprietary, 
privileged or otherwise protected by law. The message is intended solely for 
the named addressee. If received in error, please destroy and notify the 
sender. Any use of this email is prohibited when received in error. Impetus 
does not represent, warrant and/or guarantee, that the integrity of this 
communication has been maintained nor that the communication is free of errors, 
virus, interception or interference.


Re: wrong value class error

2010-11-16 Thread Harsh J
Hi,

On Tue, Nov 16, 2010 at 9:39 PM, Arindam Khaled  wrote:
> When I comment out the combiner class, it seems to work fine. Thanks.
>

That isn't a solution, but sure avoids the error. You need to
implement a proper Combiner class that emits the same Key and Value
pair as your Mapper should. Your Reducer logic emits out ,
which was the issue if you utilized the same class for Combiner too.

But do know that the Combiner may be called 0...N times per Mapper.

-- 
Harsh J
www.harshj.com


Re: wrong value class error

2010-11-16 Thread Arindam Khaled
This website has answered my question somewhat:

http://blog.pfa-labs.com/2010/01/first-stab-at-hadoop-and-map-reduce.html

When I comment out the combiner class, it seems to work fine. Thanks.

- Original Message -
From: "Arindam Khaled" 
To: common-user@hadoop.apache.org
Sent: Monday, November 15, 2010 6:05:58 PM
Subject: wrong value class error

Hello,

I am new to Hadoop and I think I'm doing something silly. I sent this  
e-mail from another account which isn't registered to hadoop user group.

I am getting the following error in my reducer.

10/11/15 15:29:11 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: wrong value class: class  
org.apache.hadoop.io.Text is not class org.apache.hadoop.io.IntWritable

Here is my reduce class:

  public static class BFIDAReducer
extends Reducer {
 private Text result = new Text();

 public void reduce(Text key, Iterable values,
Context context
) throws IOException, InterruptedException {
   Text result = new Text();
   GameFunctions gf = GameFunctions.getInstance();


   String line = "";

   for(IntWritable val: values)
 {
 line = line + val.toString() + ",";
 }

 if(line.length() > 1)
 line = (String) line.subSequence(0, line.length() - 1);

 if (gf.isSolved(key.toString(), size))
 solved = true;

   result.set(line);
   context.write(key, result);
 }
   }

And here is my partial code from job configuration:

 job.setOutputKeyClass(Text.class);
 job.setOutputValueClass(Text.class);
 job.setMapOutputKeyClass(Text.class);
 job.setMapOutputValueClass(IntWritable.class);

Can anyone help me?



I know I'll have more question in near future.

Thanks in advance.

Arindam







Re: 0.21 found interface but class was expected

2010-11-16 Thread Steve Loughran

On 14/11/10 08:05, Allen Wittenauer wrote:

[Yes, gmail people, this likely went to your junk folder. ]

On Nov 13, 2010, at 5:28 PM, Lance Norskog wrote:


It is considered good manners :)

Seriously, if you want to attract a community you have an obligation
to tell them when you're going to jerk the rug out from under their
feet.


The rug has been jerked in various ways for every micro version since 
as long as I've been with Hadoop.  Such jerkings have always (eventually) been 
for the positive with a happy ending almost every time.  No pain, no gain.


I think Steve Lewis is probably miffed at the way something came out and 
broke things. With 0.20.x being so stable for a while, it's set up 
expectations about stability that don't match what the developers have 
been working off, which is "we are a 0.x project and anything not marked 
as stable isn't".


One thing Hadoop is very sensitive about is preserving filesystem data, 
I haven't heard anything bad there. It's just here we've done some 
transitions that are forcing recompiles and that has follow on effects.


Steve Lewis: one way to stay ahead of these problems is to hook your 
hudson server up to checking out and building hadoop, then testing your 
code against it. This is how you can find and report problems before 
releases. It'd be great to get you involved in this, though I will warn 
you that merging and retesting against trunk can be quite time consuming 
at times.




Oh, one other thing.

Here we are, several fairly significant conferences later (both as the main focus 
and as one of the leading topics) and I still don't  understand why people have concerns 
about "attracting a community".  When you have what seems like 100s of 
companies creating products either built around or integrating Hadoop (the full gamut of 
several stealth startups to Major Players like IBM), it doesn't really seem like that is 
much of an issue anymore.

At this point, I'm actually in the opposite camp:  the community has 
grown TOO fast to the point that major problems in the source won't be able to 
be fixed because folks will expect less breakage.  This is especially easy for 
sites with a few hundred nodes (or with enough frosting on top) because 
everything seems to be working for them.  Many of them will  not really 
understand that at super large scales, some things just don't work.  In order 
to fix some of the issues, breakage will occur.


1. I view a few hundred nodes as large. That's because the nodes are 
getting bigger, that many machines can still be multi-PB. Even 20-50 
nodes is a place for fun, and its where most people are playing. these 
are the majority of end users, even if the few large clusters run by 
facebook, Yahoo! and LinkedIn are in a different league.


2. I am with Allan here in that there are big changes we need to get in, 
and there is a lot of inertia about doing this. But it's the cost of 
success: there is too much value in files in the filesystem, too much of 
a tangible cost of any performance problems, that people whose business 
depends on Hadoop are worried about the costs of changes. In particular, 
the NN and JT are scale limits in the big clusters, and anything that 
uses more memory or keeps things around for longer threatens the big 
clusters, no matter the benefit to the smaller ones.




The end result can either be a community divided into multiple camps 
due to forking or a community that has learned to tolerate these minor 
inconveniences when they pop up.  I for one would rather be in the latter, but 
it sure seems like some parts of the community (and in many ways, the ASF 
itself) would rather it be the former.


I don't think forking helps, but there might be benefits for
 -more agility in deployments, better Hadoop on VM work, where even the 
JT asks for machines as part of its execution plan.
 -a design of the NN that scales better by having bits of the 
filesystem handled by other namenodes, so reducing the conflict between 
the very large clusters (anything with -Xmx32g for the NN heap) and 
everyone else.


-Steve


Re: Dealing with Jobs with different memory and slots requirements

2010-11-16 Thread Steve Loughran

On 16/11/10 00:29, Marc Sturlese wrote:


I have a hadoop test cluster (12 nodes) and I am running different MapReduce
jobs. These Jobs are executed sequencially as the input of one needs the
output of the other.
I am wandering if there is a way to manage the memory of the nodes per Job.
I mean, there are jobs that use all the reduce slots of my cluster and don't
use much memory, these scale so well. But, there are others that don't use
all the reduce slots (and can't be more parallelized) and would be much
faster if i was able to asign more memory to them. I don't see a way to do
something similar to that if I don't turn off the cluster, change the nodes
conf and turn it on again. Which is pretty dirty...



It would be good if, in the same cluster, I could have some nodes with less
reducers and more memory for them and I could tell a Job to use those
nodes... but I don't think it's possible


you can't say where reducers will run, though you can give nodes 
different numbers of map or reduce slots. if your reducers are all 
memory hungry, give the machines less reduce slots than map slots.


There's work underway to be more aware of system load when scheduling 
things, rather than have a fairly simplistic "slot" model, look more at 
system load and memory load as a way of measuring how idle machines are. 
If you were to be really devious, you'd look at io load, network, 
machine temperature, etc. If you find this an interesting problem to get 
involved in, the mapreduce-dev mailing list is the place to get involved.


Be advised, scheduling and placement are CS-hard problems: fun to work 
in if you enjoy the issues, but there is no perfect solution


steve


Re: Hadoop installation on Windows

2010-11-16 Thread Steve Loughran

On 13/11/10 23:34, Christopher Worley wrote:

Thanks for the advice, guys.  I found this tutorial that covers
installation of 0.21.0:
http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/

The author suggests adding the "CLASSPATH=`cygpath -wp "$CLASSPATH"`"
to bin/hadoop-config.sh just like you suggested.  I made that change
and then checked the version.  Here's what I got:


$ bin/hadoop version
cygwin warning:
   MS-DOS style path detected: C:\cygwin\usr\local\hadoop-0.21.0\/build/native
   Preferred POSIX equivalent is: /usr/local/hadoop-0.21.0/build/native
   CYGWIN environment variable option "nodosfilewarning" turns off this warning.
   Consult the user's guide for more details about POSIX paths:
 http://cygwin.com/cygwin-ug-net/using.html#using-pathnames
Hadoop 0.21.0
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.21 -
r 985326
Compiled by tomwhite on Tue Aug 17 01:02:28 EDT 2010
 From source with checksum a1aeb15b4854808d152989ba76f90fac


It gives a warning on the path format, but it appears to work--it
gives the correct Hadoop version.

When I try to do the next step, formatting the namenode, I get the
following exceptions:
http://pastebin.com/YhZ8JZjG

When I try to run "bin/start-dfs.sh" or "bin/start-mapred.sh" I get
"Hadoop common not found".

I appreciate any help.


If you can get the classpath printed out, it may be that 
hadoop-common.jar isn't on it.


As an aside, based on my experience on other projects, the shell scripts 
to start java apps are an extreme source of pain, for various reasons

 -shell script engine variations on Unix platforms (bash, unix sh)
 -bash isn't that good a language for complex development (at least by 
java programmers)

 -cygwin causes extra confusion
 -zero/not enough testing of the scripts in the unit tests

For the limited number of lines, they generate a lot of support calls.

I prefer using python as a launcher, as it's more consistent. Hadoop may 
benefit from a python launcher, although Hadoop will probably still 
expect cygwin on win32 just for the shell commands it likes to execute.


The other option, assuming the cluster is just for local development, is 
have an ant build file to do the classpath setup. They have done the 
windows and cygwin support [1], and if you use  with 
failonerror=true and failonerror=false the ant process will stay hooked 
up (so streaming console output both ways), and pass up failures to the 
caller.


Just a thought.

-steve

[1] http://ant.apache.org/manual/running.html


Re: Hadoop installation on Windows

2010-11-16 Thread Steve Loughran

On 12/11/10 22:29, Vijay wrote:

It seems like on Windows java doesn't work well with cygwin-style paths on
classpath (/cygdrive/d/). The PlatformName error at the beginning is due to
that. This is coming from the bin/hadoop-config.sh script which is using
cygwin-styles paths for the jar files.


Windows Java isn't cygwin aware, so it only likes native paths in any 
path it sees.


wrong value class error

2010-11-16 Thread Arindam Khaled
Hello,

I am new to Hadoop. I am getting the following error in my reducer.

10/11/15 15:29:11 WARN mapred.LocalJobRunner: job_local_0001
java.io.IOException: wrong value class: class org.apache.hadoop.io.Text is
not class org.apache.hadoop.io.IntWritable

Here is my reduce class:

 public static class BFIDAReducer
   extends Reducer {
private Text result = new Text();

public void reduce(Text key, Iterable values,
   Context context
   ) throws IOException, InterruptedException {
  Text result = new Text();
  GameFunctions gf = GameFunctions.getInstance();


  String line = "";

  for(IntWritable val: values)
{
line = line + val.toString() + ",";
}

if(line.length() > 1)
line = (String) line.subSequence(0, line.length() - 1);

if (gf.isSolved(key.toString(), size))
solved = true;

  result.set(line);
  context.write(key, result);
}
  }

And here is my partial code from job configuration:

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(IntWritable.class);

Can anyone help me?

Thanks in advance.

Arindam