Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Ted Dunning
EMR instances are started near each other.  This increases the bandwidth
between nodes.

There may also be some enhancements in terms of access to the SAN that
supports EBS.

On Fri, Dec 24, 2010 at 4:41 AM, Otis Gospodnetic <
otis_gospodne...@yahoo.com> wrote:

> - Original Message 
> > From: Amandeep Khurana 
> > To: common-user@hadoop.apache.org
> > Sent: Fri, December 10, 2010 1:14:45 AM
> > Subject: Re: Hadoop/Elastic MR on AWS
> >
> > Mark,
> >
> > Using EMR makes it very easy to start a cluster and add/reduce  capacity
> as
> > and when required. There are certain optimizations that make EMR  an
> > attractive choice as compared to building your own cluster out. Using
>  EMR
>
>
> Could you please point out what optimizations you are referring to?
>


Custom input split

2010-12-24 Thread Black, Michael (IS)
Using hadoop-0.20


I'm doing custom input splits from a Lucene index.

I want to split the document ID's across N mappers (I'm testing the
scalabilty of the problem across 4 nodes and 8 cores).

So the key is the document# and they are not sequential.

At this point I'm using splits.add to add each document...but that sets up
one task for every document...not something I want to do of course.

How can I add a group of documents to each split?  I found a scant reference
to PrimeInputSplit but that doesn't seem to resolve on hadoop-0.20.


Michael D. Black
Senior Scientist
Nothrop Grumman Information Systems
Advanced Analytics Directorate





smime.p7s
Description: S/MIME cryptographic signature


Re: Hadoop/Elastic MR on AWS

2010-12-24 Thread Otis Gospodnetic
Hello Amandeep,



- Original Message 
> From: Amandeep Khurana 
> To: common-user@hadoop.apache.org
> Sent: Fri, December 10, 2010 1:14:45 AM
> Subject: Re: Hadoop/Elastic MR on AWS
> 
> Mark,
> 
> Using EMR makes it very easy to start a cluster and add/reduce  capacity as
> and when required. There are certain optimizations that make EMR  an
> attractive choice as compared to building your own cluster out. Using  EMR


Could you please point out what optimizations you are referring to?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch - Hadoop - HBase
Hadoop ecosystem search :: http://search-hadoop.com/

> also ensures you are using a production quality, stable system backed by  the
> EMR engineers. You can always use bootstrap actions to put your own  tweaked
> version of Hadoop in there if you want to do that.
> 
> Also, you  don't have to tear down your cluster after every job. You can set
> the alive  option when you start your cluster and it will stay there even
> after your  Hadoop job completes.
> 
> If you face any issues with EMR, send me a mail  offline and I'll be happy to
> help.
> 
> -Amandeep
> 
> 
> On Thu, Dec 9,  2010 at 9:47 PM, Mark   wrote:
> 
> > Does anyone have any thoughts/experiences on running Hadoop  in AWS? What
> > are some pros/cons?
> >
> > Are there any good  AMI's out there for this?
> >
> > Thanks for any advice.
> >
> 


RE: EXTERNAL:Tasktracker failing and getting black listed

2010-12-24 Thread Black, Michael (IS)
#1 Check CPU fan is working.  A hot CPU can give flakey errorsespecially 
during high CPU load.
#2 Do memtest on the machine.  You might have a bad memory stick that is 
getting hit (though I
would tend to think it would be a bit more random).
I've used memtest86 before to find such problems.
http://www.memtest86.com
#3 Check the disk.  Hopefully smartctrl is on your system which would tell you 
if any disk errors are occurring.  Otherwise use the manufacturers disk 
testing.

Or...
If you can, move the CPU and/or memory and/or disk between two machines and 
see if the problem migrates to the other machine.  I'd probably do all 3 at 
once just to confirm it's one of them, then move them back one at a time.



Michael D. Black
Senior Scientist
Nothrop Grumman Information Systems
Advanced Analytics Directorate



-Original Message-
From: Jim Twensky [mailto:jim.twen...@gmail.com]
Sent: Thursday, December 23, 2010 4:37 PM
To: core-u...@hadoop.apache.org
Subject: EXTERNAL:Tasktracker failing and getting black listed

Hi,

I have a 16+1 node hadoop cluster where all tasktrackers (and
datanodes) are connected to the same switch and share the exact same
hardware and software configuration. When I run a hadoop job, one of
the task trackers always produces one of these two errors ONLY during
the reduce tasks and gets blacklisted eventually.

-
org.apache.hadoop.fs.ChecksumException: Checksum Error
at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:374)
at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:86)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:173)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
-

or

-
java.lang.RuntimeException: next value iterator failed
at 
org.apache.hadoop.mapreduce.ReduceContext$ValueIterator.next(ReduceContext.java:160)
at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:17)
at src.expinions.PhraseGen.ReduceClass.reduce(ReduceClass.java:10)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:174)
at 
org.apache.hadoop.mapred.Task$NewCombinerRunner.combine(Task.java:1214)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1500)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1116)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:512)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:585)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.fs.ChecksumException: Checksum Error
at 
org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:164)
at 
org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
at org.apache.hadoop.mapred.IFile$Reader.readData(IFile.java:328)
at org.apache.hadoop.mapred.IFile$Reader.rejigData(IFile.java:358)
at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:342)
at org.apache.hadoop.mapred.IFile$Reader.next(IFile.java:404)
at org.apache.hadoop.mapred.Merger$Segment.next(Merger.java:220)
at 
org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:330)
at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:350)
at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:111)
at 
org.apach

Help: Could not obtain block: blk_ Exception

2010-12-24 Thread Tali K




Hi All, 
I am getting Could not obtain block: blk_2706642997966533027_4482 
file=/user/outputwc425729652_0/part-r-0
I checked the file is actually there.
 What I should do?
Please help.


 Could not obtain block: blk_2706642997966533027_4482 
file=/user/outputwc425729652_0/part-r-0
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1812)
at 
org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1638)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read(DFSClient.java:1767)
at java.io.DataInputStream.read(DataInputStream.java:132)
at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:264)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:306)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:158)
at java.io.InputStreamReader.read(InputStreamReader.java:167)
at java.io.BufferedReader.fill(BufferedReader.java:136)
at java.io.BufferedReader.readLine(BufferedReader.java:299)
at java.io.BufferedReader.readLine(BufferedReader.java:362)
at 
speeditup.ClusterByWordCountFSDriver$ClusterBasedOnWordCountMapper.map(ClusterByWordCountFSDriver.java:157)
at 
speeditup.ClusterByWordCountFSDriver$ClusterBasedOnWordCountMapper.map(ClusterByWordCountFSDriver.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)

  

how to build hadoop-hdfs trunk locally

2010-12-24 Thread sravankumar
Hi,

 

I am facing problems in building hadoop-hdfs 

It is trying to download hadoop-common-.jar

 

Does anyone know how to overcome this problem.

 

ivy:resolve]  WARNINGS 
[ivy:resolve] module not found:
org.apache.hadoop#hadoop-common;0.23.0-SNAPSHOT 
[ivy:resolve]  apache-snapshot: tried 
[ivy:resolve]

https://repository.apache.org/content/repositories/snapshots/org/apache/hado
op/hadoop-common/0.23.0-SNAPSHOT/hadoop-common-0.23.0-SNAPSHOT.pom 
[ivy:resolve]   -- artifact
org.apache.hadoop#hadoop-common;0.23.0-SNAPSHOT!hadoop-common.jar: 
[ivy:resolve]

https://repository.apache.org/content/repositories/snapshots/org/apache/hado
op/hadoop-common/0.23.0-SNAPSHOT/hadoop-common-0.23.0-SNAPSHOT.jar 
[ivy:resolve]  maven2: tried 
[ivy:resolve]

http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/0.23.0-SNAPSHO
T/hadoop-common-0.23.0-SNAPSHOT.pom 
[ivy:resolve]   -- artifact
org.apache.hadoop#hadoop-common;0.23.0-SNAPSHOT!hadoop-common.jar: 
[ivy:resolve]

http://repo1.maven.org/maven2/org/apache/hadoop/hadoop-common/0.23.0-SNAPSHO
T/hadoop-common-0.23.0-SNAPSHOT.jar 
[ivy:resolve] ::

[ivy:resolve] ::  UNRESOLVED DEPENDENCIES ::

[ivy:resolve] ::

[ivy:resolve] ::
org.apache.hadoop#hadoop-common;0.23.0-SNAPSHOT: not found 
[ivy:resolve] ::

 

Thanks & regards,

Sravan kumar.