Re: Jobs stalling forever

2009-03-10 Thread Amareshwari Sriramadasu

This is due to HADOOP-5233. Got fixed in branch 0.19.2

-Amareshwari
Nathan Marz wrote:
Every now and then, I have jobs that stall forever with one map task 
remaining. The last map task remaining says it is at "100%" and in the 
logs, it says it is in the process of committing. However, the task 
never times out, and the job just sits there forever. Has anyone else 
seen this? Is there a JIRA ticket open for it already?




Re: How to increase replication factor

2009-03-10 Thread Edwin Chu
Thank you very much.

On Tue, Mar 10, 2009 at 6:58 PM, Tamir Kamara  wrote:

> You can use the setrep option to (re)set the replication of specific files
> and directories. More details can be found here:
> http://hadoop.apache.org/core/docs/current/hdfs_shell.html#setrep
>
>
> On Tue, Mar 10, 2009 at 12:28 PM, Edwin Chu  wrote:
>
> > Hi
> > I am adding some new nodes to an Hadoop cluster and try to increase the
> > replication factor. I changed the replication factor value in
> > hadoop-site.xml and then restarted the cluster using the stop-all.sh and
> > start-all.sh script. Then, I run hadoop fsck. It reports that the fs is
> > healthy, but I found that the Average block replication value is less
> than
> > the configured replication factor. I guess the existing blocks are not
> > re-replicated after changing the replication factor. Can I force the
> > existing blocks to be replicated according to the new replication factor?
> >
> > Regards
> > Edwin
> >
>


Why is large number of [(heavy) keys , (light) value] faster than (light)key , (heavy) value

2009-03-10 Thread Gyanit

I have large number of key,value pairs. I don't actually care if data goes in
value or key. Let me be more exact. 
(k,v) pair after combiner is about 1 mil. I have approx 1kb data for each
pair. I can put it in keys or values.
I have experimented with both options (heavy key , light value)  vs (light
key, heavy value). It turns out that hk,lv option is much much better than
(lk,hv). 
Has someone else also noticed this?
Is there a way to make things faster in light key , heavy value option. As
some application will need that also. 
Remember in both cases we are talking about atleast dozen or so million
pairs.
There is a difference of time in shuffle phase. Which is weird as amount of
data transferred is same.

-gyanit
-- 
View this message in context: 
http://www.nabble.com/Why-is-large-number-of---%28heavy%29-keys-%2C-%28light%29-value--faster-than-%28light%29key-%2C-%28heavy%29-value-tp22447877p22447877.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Error while putting data onto hdfs

2009-03-10 Thread Amandeep Khurana
I was trying to put a 1 gig file onto HDFS and I got the following error:

09/03/10 18:23:16 WARN hdfs.DFSClient: DataStreamer Exception:
java.net.SocketTimeoutException: 5000 millis timeout while waiting for
channel to be ready for write. ch :
java.nio.channels.SocketChannel[connected local=/171.69.102.53:34414remote=/
171.69.102.51:50010]
at
org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:162)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:146)
at
org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:107)
at java.io.BufferedOutputStream.write(Unknown Source)
at java.io.DataOutputStream.write(Unknown Source)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2209)

09/03/10 18:23:16 WARN hdfs.DFSClient: Error Recovery for block
blk_2971879428934911606_36678 bad datanode[0] 171.69.102.51:50010
put: All datanodes 171.69.102.51:50010 are bad. Aborting...
Exception closing file /user/amkhuran/221rawdata/1g
java.io.IOException: Filesystem closed
at org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:198)
at org.apache.hadoop.hdfs.DFSClient.access$600(DFSClient.java:65)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.closeInternal(DFSClient.java:3084)
at
org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.close(DFSClient.java:3053)
at
org.apache.hadoop.hdfs.DFSClient$LeaseChecker.close(DFSClient.java:942)
at org.apache.hadoop.hdfs.DFSClient.close(DFSClient.java:210)
at
org.apache.hadoop.hdfs.DistributedFileSystem.close(DistributedFileSystem.java:243)
at org.apache.hadoop.fs.FsShell.close(FsShell.java:1842)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1856)


Whats going wrong?

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


Re: question about released version id

2009-03-10 Thread ChihChun Chu
Thanks a lot, Owen and Rasit.
Is there any rule for deciding incompatible API changes will be put into
trunk or branch?
The criterion is "time" or "importance of these changes" or "number of these
chnages" or others?

Chu

2009/3/10 Owen O'Malley 

> On Mar 2, 2009, at 11:46 PM, 鞠適存 wrote:
>
>> I wonder how to make the hadoop version number.
>>
>
> Each 0.18, 0.19 and 0.20 have their own branch. The first release on each
> branch is 0.X.0, and then 0.X.1 and so on. New features are only put into
> trunk and only important bug fixes are put into the branches. So there will
> be no new functionality going from 0.X.1 to 0.X.2, but there will be going
> from a release of 0.X to 0.X+1.
>
> -- Owen


Re: Extending ClusterMapReduceTestCase

2009-03-10 Thread jason hadoop
The other goofy thing is that the  xml parser that is commonly first in the
class path, validates xml in a way that is opposite to what jetty wants.

This line in the preamble before theClusterMapReduceTestCase setup takes
care of the xml errors.

System.setProperty("javax.xml.parsers.SAXParserFactory","org.apache.xerces.jaxp.SAXParserFactoryImpl");


On Tue, Mar 10, 2009 at 2:28 PM, jason hadoop wrote:

> There are a couple of failures that happen in tests derived from
> ClusterMapReduceTestCase that are run outside of the hadoop unit test
> framework.
>
> The basic issue is that the unit test doesn't have the benefit of a runtime
> environment setup by the bin/hadoop script.
>
> The classpath is usually missing the lib/jetty-ext/*.jar files, and doesn't
> get the conf/hadoop-default.xml and conf/hadoop-site.xml.
> The *standard* properties are also unset.. hadoop.log.dir,
> hadoop.log.file, hadoop.home.dir, hadoop.id.str, hadoop.root.logger
>
> I find that I can get away with just defining hadoop.log.dir.


>
> You can read about this in detail in the chapter on unit testing map/reduce
> jobs in my book, out real soon now :)
>
>
>
>
> On Tue, Mar 10, 2009 at 12:08 PM, Brian Forney wrote:
>
>> Hi all,
>>
>> I'm trying to write a JUnit test case that extends
>> ClusterMapReduceTestCase
>> to test some code I've written to ease job submission and monitoring
>> between
>> some existing code. Unfortunately, I see the following problem and cannot
>> find the jetty 5.1.4 code anywhere online. Any ideas about why this is
>> happening?
>>
>>[junit] Testsuite: com.integral7.batch.hadoop.test.TestJobController
>>[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.384 sec
>>[junit]
>>[junit] - Standard Output ---
>>[junit] 2009-03-10 12:52:26,303 [main] ERROR
>>
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
>> :290) - FSNamesystem initialization failed.
>>[junit] java.io.IOException: Problem starting http server
>>[junit] at
>> org.apache.hadoop.http.HttpServer.start(HttpServer.java:343)
>>[junit] at
>>
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
>> java:379)
>>[junit] at
>>
>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
>> :288)
>>[junit] at
>>
>> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
>> )
>>[junit] at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
>>[junit] at
>> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
>>[junit] at
>>
>> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
>> :859)
>>[junit] at
>> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:275)
>>[junit] at
>> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119)
>>[junit] at
>>
>> org.apache.hadoop.mapred.ClusterMapReduceTestCase.startCluster(ClusterMapRed
>> uceTestCase.java:81)
>>[junit] at
>>
>> org.apache.hadoop.mapred.ClusterMapReduceTestCase.setUp(ClusterMapReduceTest
>> Case.java:56)
>>[junit] at
>>
>> com.integral7.batch.hadoop.test.TestJobController.setUp(TestJobController.ja
>> va:49)
>>[junit] at junit.framework.TestCase.runBare(TestCase.java:132)
>>[junit] at
>> junit.framework.TestResult$1.protect(TestResult.java:110)
>>[junit] at
>> junit.framework.TestResult.runProtected(TestResult.java:128)
>>[junit] at junit.framework.TestResult.run(TestResult.java:113)
>>[junit] at junit.framework.TestCase.run(TestCase.java:124)
>>[junit] at junit.framework.TestSuite.runTest(TestSuite.java:232)
>>[junit] at junit.framework.TestSuite.run(TestSuite.java:227)
>>[junit] at
>>
>> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81
>> )
>>[junit] at
>> junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:36)
>>[junit] at
>>
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu
>> nner.java:421)
>>[junit] at
>>
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTes
>> tRunner.java:912)
>>[junit] at
>>
>> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR
>> unner.java:766)
>>[junit] Caused by:
>> org.mortbay.util.MultiException[org.xml.sax.SAXParseException: The
>> processing instruction target matching "[xX][mM][lL]" is not allowed.,
>> org.xml.sax.SAXParseException: The processing instruction target matching
>> "[xX][mM][lL]" is not allowed.]
>>[junit] at org.mortbay.http.HttpServer.doStart(HttpServer.java:731)
>>[junit] at org.mortbay.util.Container.start(Container.java:72)
>>[junit] at
>> org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
>>[junit] ... 23 more
>>[junit] -  ---
>>[junit] Testc

Re: Extending ClusterMapReduceTestCase

2009-03-10 Thread jason hadoop
There are a couple of failures that happen in tests derived from
ClusterMapReduceTestCase that are run outside of the hadoop unit test
framework.

The basic issue is that the unit test doesn't have the benefit of a runtime
environment setup by the bin/hadoop script.

The classpath is usually missing the lib/jetty-ext/*.jar files, and doesn't
get the conf/hadoop-default.xml and conf/hadoop-site.xml.
The *standard* properties are also unset.. hadoop.log.dir, hadoop.log.file,
hadoop.home.dir, hadoop.id.str, hadoop.root.logger

I find that I can get away with just defining hadoop.log.dir.

You can read about this in detail in the chapter on unit testing map/reduce
jobs in my book, out real soon now :)



On Tue, Mar 10, 2009 at 12:08 PM, Brian Forney wrote:

> Hi all,
>
> I'm trying to write a JUnit test case that extends ClusterMapReduceTestCase
> to test some code I've written to ease job submission and monitoring
> between
> some existing code. Unfortunately, I see the following problem and cannot
> find the jetty 5.1.4 code anywhere online. Any ideas about why this is
> happening?
>
>[junit] Testsuite: com.integral7.batch.hadoop.test.TestJobController
>[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.384 sec
>[junit]
>[junit] - Standard Output ---
>[junit] 2009-03-10 12:52:26,303 [main] ERROR
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
> :290) - FSNamesystem initialization failed.
>[junit] java.io.IOException: Problem starting http server
>[junit] at
> org.apache.hadoop.http.HttpServer.start(HttpServer.java:343)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
> java:379)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
> :288)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
> )
>[junit] at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
>[junit] at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
> :859)
>[junit] at
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:275)
>[junit] at
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119)
>[junit] at
>
> org.apache.hadoop.mapred.ClusterMapReduceTestCase.startCluster(ClusterMapRed
> uceTestCase.java:81)
>[junit] at
>
> org.apache.hadoop.mapred.ClusterMapReduceTestCase.setUp(ClusterMapReduceTest
> Case.java:56)
>[junit] at
>
> com.integral7.batch.hadoop.test.TestJobController.setUp(TestJobController.ja
> va:49)
>[junit] at junit.framework.TestCase.runBare(TestCase.java:132)
>[junit] at junit.framework.TestResult$1.protect(TestResult.java:110)
>[junit] at
> junit.framework.TestResult.runProtected(TestResult.java:128)
>[junit] at junit.framework.TestResult.run(TestResult.java:113)
>[junit] at junit.framework.TestCase.run(TestCase.java:124)
>[junit] at junit.framework.TestSuite.runTest(TestSuite.java:232)
>[junit] at junit.framework.TestSuite.run(TestSuite.java:227)
>[junit] at
>
> org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81
> )
>[junit] at
> junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:36)
>[junit] at
>
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu
> nner.java:421)
>[junit] at
>
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTes
> tRunner.java:912)
>[junit] at
>
> org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR
> unner.java:766)
>[junit] Caused by:
> org.mortbay.util.MultiException[org.xml.sax.SAXParseException: The
> processing instruction target matching "[xX][mM][lL]" is not allowed.,
> org.xml.sax.SAXParseException: The processing instruction target matching
> "[xX][mM][lL]" is not allowed.]
>[junit] at org.mortbay.http.HttpServer.doStart(HttpServer.java:731)
>[junit] at org.mortbay.util.Container.start(Container.java:72)
>[junit] at
> org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
>[junit] ... 23 more
>[junit] -  ---
>[junit] Testcase:
> testJobSubmission(com.integral7.batch.hadoop.test.TestJobController):
> Caused an ERROR
>[junit] Problem starting http server
>[junit] java.io.IOException: Problem starting http server
>[junit] at
> org.apache.hadoop.http.HttpServer.start(HttpServer.java:343)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
> java:379)
>[junit] at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
> :288)
>[junit] at
>
> org.apache.hadoop.hdfs.

Jobs stalling forever

2009-03-10 Thread Nathan Marz
Every now and then, I have jobs that stall forever with one map task  
remaining. The last map task remaining says it is at "100%" and in the  
logs, it says it is in the process of committing. However, the task  
never times out, and the job just sits there forever. Has anyone else  
seen this? Is there a JIRA ticket open for it already?


Re: HDFS is corrupt, need to salvage the data.

2009-03-10 Thread Mayuran Yogarajah

Raghu Angadi wrote:

The block files usually don't disappear easily. Check on the datanode if
you find any files starting with "blk". Also check datanode log to see
what happened there... may be use started on a different directory or
something like that.

Raghu.
  


There are indeed blk files:
find -name 'blk*' | wc -l
158

I didn't see anything out of the ordinary in the datanode log.

At this point is there anything I can do to recover the files? Or do I 
need to reformat

the data node and load the data in again ?

thanks


Re: HDFS is corrupt, need to salvage the data.

2009-03-10 Thread Raghu Angadi

Mayuran Yogarajah wrote:

lohit wrote:

How many Datanodes do you have.
From the output it looks like at the point when you ran fsck, you had 
only one datanode connected to your NameNode. Did you have others?
Also, I see that your default replication is set to 1. Can you check 
if  your datanodes are up and running.

Lohit


  
There is only one data node at the moment.  Does this mean the data is 
not recoverable?
The HD on the machine seems fine so I'm a little confused as to what 
caused the HDFS to

become corrupted.


The block files usually don't disappear easily. Check on the datanode if 
you find any files starting with "blk". Also check datanode log to see 
what happened there... may be use started on a different directory or 
something like that.


Raghu.


Why LineRecrodReader has 3 constructors?

2009-03-10 Thread Abdul Qadeer
org.apache.hadoop.mapred.LineRecordReader has 3 constructors.

The one in the following is used normally.

public LineRecordReader(Configuration job,  FileSplit split) throws
IOException

But when the following ones are used?  I commented them and re-compiled the
code without errors.  So probably they are not used directly in Hadoop core
code.
But then why are they there in the code?


public LineRecordReader(InputStream in, long offset, long endOffset, int
maxLineLength)
public LineRecordReader(InputStream in, long offset, long endOffset,
Configuration job)


Thanks,
Abdul Qadeer


Extending ClusterMapReduceTestCase

2009-03-10 Thread Brian Forney
Hi all,

I'm trying to write a JUnit test case that extends ClusterMapReduceTestCase
to test some code I've written to ease job submission and monitoring between
some existing code. Unfortunately, I see the following problem and cannot
find the jetty 5.1.4 code anywhere online. Any ideas about why this is
happening?

[junit] Testsuite: com.integral7.batch.hadoop.test.TestJobController
[junit] Tests run: 1, Failures: 0, Errors: 1, Time elapsed: 1.384 sec
[junit] 
[junit] - Standard Output ---
[junit] 2009-03-10 12:52:26,303 [main] ERROR
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
:290) - FSNamesystem initialization failed.
[junit] java.io.IOException: Problem starting http server
[junit] at 
org.apache.hadoop.http.HttpServer.start(HttpServer.java:343)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
java:379)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
:288)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
:859)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:275)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119)
[junit] at 
org.apache.hadoop.mapred.ClusterMapReduceTestCase.startCluster(ClusterMapRed
uceTestCase.java:81)
[junit] at 
org.apache.hadoop.mapred.ClusterMapReduceTestCase.setUp(ClusterMapReduceTest
Case.java:56)
[junit] at 
com.integral7.batch.hadoop.test.TestJobController.setUp(TestJobController.ja
va:49)
[junit] at junit.framework.TestCase.runBare(TestCase.java:132)
[junit] at junit.framework.TestResult$1.protect(TestResult.java:110)
[junit] at 
junit.framework.TestResult.runProtected(TestResult.java:128)
[junit] at junit.framework.TestResult.run(TestResult.java:113)
[junit] at junit.framework.TestCase.run(TestCase.java:124)
[junit] at junit.framework.TestSuite.runTest(TestSuite.java:232)
[junit] at junit.framework.TestSuite.run(TestSuite.java:227)
[junit] at 
org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:81
)
[junit] at 
junit.framework.JUnit4TestAdapter.run(JUnit4TestAdapter.java:36)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.run(JUnitTestRu
nner.java:421)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.launch(JUnitTes
tRunner.java:912)
[junit] at 
org.apache.tools.ant.taskdefs.optional.junit.JUnitTestRunner.main(JUnitTestR
unner.java:766)
[junit] Caused by:
org.mortbay.util.MultiException[org.xml.sax.SAXParseException: The
processing instruction target matching "[xX][mM][lL]" is not allowed.,
org.xml.sax.SAXParseException: The processing instruction target matching
"[xX][mM][lL]" is not allowed.]
[junit] at org.mortbay.http.HttpServer.doStart(HttpServer.java:731)
[junit] at org.mortbay.util.Container.start(Container.java:72)
[junit] at 
org.apache.hadoop.http.HttpServer.start(HttpServer.java:321)
[junit] ... 23 more
[junit] -  ---
[junit] Testcase:
testJobSubmission(com.integral7.batch.hadoop.test.TestJobController):
Caused an ERROR
[junit] Problem starting http server
[junit] java.io.IOException: Problem starting http server
[junit] at 
org.apache.hadoop.http.HttpServer.start(HttpServer.java:343)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.
java:379)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java
:288)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163
)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:208)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:194)
[junit] at 
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java
:859)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:275)
[junit] at 
org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:119)
[junit] at 
org.apache.hadoop.mapred.ClusterMapReduceTestCase.startCluster(ClusterMapRed
uceTestCase.java:81)
[junit] at 
org.apache.hadoop.mapred.ClusterMapReduceTestCase.setUp(ClusterMapReduceTest
Case.java:56)
[junit] at 
com.integral7.batch.hadoop.test.TestJobController.setUp(TestJobController.ja
va:49)
[junit] Caused by:
org.mortbay.util.MultiException[org.xml.sax.SAXParseExcepti

streaming inputformat: class not found

2009-03-10 Thread t-alleyne

Hello,

I'm try to run a mapreduce job on a data file in which the keys and values
alternate rows.  E.g.

key1
value1
key2
...

I've written my own InputFormat by extending FileInputFormat (the code for
this class is below.)  The problem is that when I run hadoop streaming with
the command

bin/hadoop jar contrib/streaming/hadoop-0.18.3-streaming.jar -mapper
mapper.pl -reducer org.apache.hadoop.mapred.lib.IdentityReducer -input
test.data -output test-output -file  -inputformat
MyFormatter

I get the error

-inputformat : class not found : MyFormatter
java.lang.RuntimeException: -inputformat : class not found : MyFormatter
at org.apache.hadoop.streaming.StreamJob.fail(StreamJob.java:550)
...

I have tried putting .java, .class, and .jar file of MyFormatter in the job
jar using the -file parameter.  I have also tried putting them on the hdfs
using -copyFromLocal, but I still get the same error.  Can anyone give me
some hints as to what the problem might be?  Also, I tried to hack together
my formatter based on the hadoop examples, so does it seems like it should
properly process the input files I described above?

Trevis




public final class MyFormatter extends
org.apache.hadoop.mapred.FileInputFormat

{

@Override
public RecordReader getRecordReader( InputSplit split,
JobConf job, Reporter reporter ) throws IOException
{
return new MyRecordReader( job, (FileSplit) split );
}



static class MyRecordReader implements RecordReader
{
private LineRecordReader _in   = null;
private LongWritable _junk = null;

public FastaRecordReader( JobConf job, FileSplit split ) throws
IOException
{
_junk = new LongWritable();

_in = new LineRecordReader( job, split );
}

@Override
public void close() throws IOException
{
_in.close();
}

@Override
public Text createKey()
{
return new Text();
}

@Override
public Text createValue()
{
return new Text();
}

@Override
public long getPos() throws IOException
{
return _in.getPos();
}

@Override
public float getProgress() throws IOException
{
return _in.getProgress();
}

@Override
public boolean next( Text key, Text value ) throws IOException
{
if ( _in.next( _junk, key ) )
{
if ( _in.next( _junk, value ) )
{
return true;
}
}

key.clear();
value.clear();

return false;
}
}
}
-- 
View this message in context: 
http://www.nabble.com/streaming-inputformat%3A-class-not-found-tp22439420p22439420.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.



Re: HDFS is corrupt, need to salvage the data.

2009-03-10 Thread Mayuran Yogarajah

lohit wrote:

How many Datanodes do you have.
From the output it looks like at the point when you ran fsck, you had only one 
datanode connected to your NameNode. Did you have others?
Also, I see that your default replication is set to 1. Can you check if  your 
datanodes are up and running.
Lohit


  
There is only one data node at the moment.  Does this mean the data is 
not recoverable?
The HD on the machine seems fine so I'm a little confused as to what 
caused the HDFS to

become corrupted.

M


Re: Support for zipped input files

2009-03-10 Thread Ken Weiner
Thanks very much, Tom.  You saved me a lot of time by confirming that it
isn't available yet.  I'll go vote for HADOOP-1824.

On Tue, Mar 10, 2009 at 3:23 AM, Tom White  wrote:

> Hi Ken,
>
> Unfortunately, Hadoop doesn't yet support MapReduce on zipped files
> (see https://issues.apache.org/jira/browse/HADOOP-1824), so you'll
> need to write a program to unzip them and write them into HDFS first.
>
> Cheers,
> Tom
>
> On Tue, Mar 10, 2009 at 4:11 AM, jason hadoop 
> wrote:
> > Hadoop has support for S3, the compression support is handled at another
> > level and should also work.
> >
> >
> > On Mon, Mar 9, 2009 at 9:05 PM, Ken Weiner  wrote:
> >
> >> I have a lot of large zipped (not gzipped) files sitting in an Amazon S3
> >> bucket that I want to process.  What is the easiest way to process them
> >> with
> >> a Hadoop map-reduce job?  Do I need to write code to transfer them out
> of
> >> S3, unzip them, and then move them to HDFS before running my job, or
> does
> >> Hadoop have support for processing zipped input files directly from S3?
> >>
> >
> >
> >
> > --
> > Alpha Chapters of my book on Hadoop are available
> > http://www.apress.com/book/view/9781430219422
> >
>


Re: Native Libraries

2009-03-10 Thread Stefan Podkowinski
I've been able to notice this kind of output in the job tracker web interface.
Open a job and drill down to one of the task logs and select 'All'.
Should be somewhere on the top of the output.

On Tue, Mar 10, 2009 at 2:52 PM, Tamir Kamara  wrote:
> Hi,
>
> I'm using hadoop 0.18.3 and I wish to see the status of hadoop native
> libraries. According to
> http://hadoop.apache.org/core/docs/r0.18.3/native_libraries.html I should be
> seeing something like:
> INFO util.NativeCodeLoader - Loaded the native-hadoop library
> or:
> INFO util.NativeCodeLoader - Unable to load native-hadoop library for your
> platform... using builtin-java classes where applicable
>
> I've scanned all the files in the log directory on several nodes but I can
> locate and thing about this.
>
> Why do you think I can't see anything in the logs ?
> Should I do a specific thing in order to use the libraries ?
>
> Thanks,
> Tamir
>


Native Libraries

2009-03-10 Thread Tamir Kamara
Hi,

I'm using hadoop 0.18.3 and I wish to see the status of hadoop native
libraries. According to
http://hadoop.apache.org/core/docs/r0.18.3/native_libraries.html I should be
seeing something like:
INFO util.NativeCodeLoader - Loaded the native-hadoop library
or:
INFO util.NativeCodeLoader - Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable

I've scanned all the files in the log directory on several nodes but I can
locate and thing about this.

Why do you think I can't see anything in the logs ?
Should I do a specific thing in order to use the libraries ?

Thanks,
Tamir


Re: Support for zipped input files

2009-03-10 Thread tim robertson
There is LZO support with a patch:
http://blog.oskarsson.nu/2009/03/hadoop-feat-lzo-save-disk-space-and.html

Cheers

Tim


On Tue, Mar 10, 2009 at 12:23 PM, Tom White  wrote:
> Hi Ken,
>
> Unfortunately, Hadoop doesn't yet support MapReduce on zipped files
> (see https://issues.apache.org/jira/browse/HADOOP-1824), so you'll
> need to write a program to unzip them and write them into HDFS first.
>
> Cheers,
> Tom
>
> On Tue, Mar 10, 2009 at 4:11 AM, jason hadoop  wrote:
>> Hadoop has support for S3, the compression support is handled at another
>> level and should also work.
>>
>>
>> On Mon, Mar 9, 2009 at 9:05 PM, Ken Weiner  wrote:
>>
>>> I have a lot of large zipped (not gzipped) files sitting in an Amazon S3
>>> bucket that I want to process.  What is the easiest way to process them
>>> with
>>> a Hadoop map-reduce job?  Do I need to write code to transfer them out of
>>> S3, unzip them, and then move them to HDFS before running my job, or does
>>> Hadoop have support for processing zipped input files directly from S3?
>>>
>>
>>
>>
>> --
>> Alpha Chapters of my book on Hadoop are available
>> http://www.apress.com/book/view/9781430219422
>>
>


Re: Support for zipped input files

2009-03-10 Thread Tom White
Hi Ken,

Unfortunately, Hadoop doesn't yet support MapReduce on zipped files
(see https://issues.apache.org/jira/browse/HADOOP-1824), so you'll
need to write a program to unzip them and write them into HDFS first.

Cheers,
Tom

On Tue, Mar 10, 2009 at 4:11 AM, jason hadoop  wrote:
> Hadoop has support for S3, the compression support is handled at another
> level and should also work.
>
>
> On Mon, Mar 9, 2009 at 9:05 PM, Ken Weiner  wrote:
>
>> I have a lot of large zipped (not gzipped) files sitting in an Amazon S3
>> bucket that I want to process.  What is the easiest way to process them
>> with
>> a Hadoop map-reduce job?  Do I need to write code to transfer them out of
>> S3, unzip them, and then move them to HDFS before running my job, or does
>> Hadoop have support for processing zipped input files directly from S3?
>>
>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
>


Re: How to increase replication factor

2009-03-10 Thread Tamir Kamara
You can use the setrep option to (re)set the replication of specific files
and directories. More details can be found here:
http://hadoop.apache.org/core/docs/current/hdfs_shell.html#setrep


On Tue, Mar 10, 2009 at 12:28 PM, Edwin Chu  wrote:

> Hi
> I am adding some new nodes to an Hadoop cluster and try to increase the
> replication factor. I changed the replication factor value in
> hadoop-site.xml and then restarted the cluster using the stop-all.sh and
> start-all.sh script. Then, I run hadoop fsck. It reports that the fs is
> healthy, but I found that the Average block replication value is less than
> the configured replication factor. I guess the existing blocks are not
> re-replicated after changing the replication factor. Can I force the
> existing blocks to be replicated according to the new replication factor?
>
> Regards
> Edwin
>


How to increase replication factor

2009-03-10 Thread Edwin Chu
Hi
I am adding some new nodes to an Hadoop cluster and try to increase the
replication factor. I changed the replication factor value in
hadoop-site.xml and then restarted the cluster using the stop-all.sh and
start-all.sh script. Then, I run hadoop fsck. It reports that the fs is
healthy, but I found that the Average block replication value is less than
the configured replication factor. I guess the existing blocks are not
re-replicated after changing the replication factor. Can I force the
existing blocks to be replicated according to the new replication factor?

Regards
Edwin


Re: Batch processing map reduce jobs

2009-03-10 Thread Jimmy Wan
Check out Cascading, it worked great for me.
http://www.cascading.org/

Jimmy Wan

On Thu, Mar 5, 2009 at 17:53, Richa Khandelwal  wrote:
> Hi All,
> Does anyone know how to run map reduce jobs using pipes or batch process map
> reduce jobs?