Success hadoop/pig jobs didn't remove _temporary folder

2012-08-14 Thread Stanley Xu
Dear all,

We have met this issue randomly. Some of our jobs will occasionally keep
the _temporary folder in the output still there. And it will lead the job
depends on the output directory failed since the _temporary directory could
not be processed correctly by the input format.

We are using hadoop 0.20.3 with some patches and pig 0.8.1. I am wondering
if some one have met the same issue?

Best wishes,
Stanley Xu


Is there anyway I could use ClusterMapReduceTestCase with the 0.20 new API?

2011-06-13 Thread Stanley Xu
Dear All,

I am trying to write a test case for my mapreduce job, which use the new api
in 0.20(using the mapreduce packege rather than the mapred package). For the
mapper used the distributed cache, so I could not use the mrunit to do the
test. I thought I could use the ClusterMapReduceTestCase to set up a mini
cluster for testing.

But it looks that if I just call the job.waitForCompletion, it tried to find
the input path from the local file system rather than the hdfs created by
the mini cluster. I am wondering if there is anything I could do to run the
hadoop 0.20 new api on the cluster created by the ClusterMapReduceTestCase?

Thanks in advance.

Best wishes,
Stanley Xu


How does the region server know if a block is moved from one datanode to another?

2011-05-15 Thread Stanley Xu
Dear all,

We were tracing a issue we have with our hbase cluster. We are almost sure
it is a network issue since the problem seems disappeared after we disabled
the ip_forward on all the machines and configured the route to the same
configuration. But we didn't really know how these configuration might
impact the cluster.

The problem we have met could be found by the following link:
http://search-hadoop.com/m/ZpgJ623GoyU1/.META.+inconsistencysubj=The+META+data+inconsistency+issue
(The title is not proper for the issue in fact.)

And by tracing the logs from region server, data node and name node, I also
found something with doubt after we thought the issue is fixed and before
the issue appeared.

In a region server, I could still find some logs that the RegionServer tried
to get a block from a data node, which is no longer served by the data node.

I see the following log in region server for block 5056551999889621449
http://pastebin.com/epEt37JK

And following log in the data node the region server try to get the block.
http://pastebin.com/pnif75rX

And following log in the name node which let the data node to delete the
block.
http://pastebin.com/rQ4QjUcS

And if I use fsck to check the file on hdfs, it has 4 replications, which
also contains the data node that should have deleted the block.
http://pastebin.com/2DecD9GD

But if I check the data node's local file system, I could see that the block
no longer exist in the local fs.

But after 6-7 hours, when I re-run fsck, the data node which should delete
the block no longer exist.
http://pastebin.com/014h3qNE

I am wondering if is it a correct behavior for hadoop and hbase? I am using
hadoop branch-0.20-append and hbase 0.20.6

I am wondering except reading all the code, if there is a document or
tutorial describe how the hadoop and hbase get the data synchronized in a
more detail level comparing to hbase book or official document?

Best wishes,
Stanley Xu


Re: c++ program

2011-03-15 Thread Stanley Xu
http://wiki.apache.org/hadoop/C%2B%2BWordCount

The famous word count example in cxx could be found on hadoop wiki.

Best wishes,
Stanley Xu



On Tue, Mar 15, 2011 at 7:09 PM, Manish Yadav manish.ya...@orkash.comwrote:

 hi
 can any one tell me how to run a simple hello world program  written in c
 ++ on hadoop. i know that hadoop is for map reduce ,hadoop uses pipes for
 c++.
 but just for experimentation purpose can anybody tell me how to do this .
 can any one tell me how to run simple Hello worl program in c or c++?





Re: setJarByClass question

2011-02-24 Thread Stanley Xu
The jar in the command line might only be the jar to submit the map-reduce
job, rather than the jar contains the Mapper and Reducer which will be
transferred to different node.

What the hadoop jar your-jar really did, is setting the classpath and other
related environment, and run the main method in your-jar. You might have a
different map-reduce-jar in the classpath which contains the real mapper and
reducer used to do the job.

Best wishes,
Stanley Xu



On Fri, Feb 25, 2011 at 7:23 AM, Mark Kerzner markkerz...@gmail.com wrote:

 Hi, this call,

 job.setJarByClass

 tells Hadoop which jar to use. But we also tell Hadoop which jar to use on
 the command line,

 hadoop jar your-jar parameters

 Why do we need this in both places?

 Thank you,
 Mark



Re: JobConf.setQueueName(xxx) with the new api using hadoop 0.20.2

2011-02-22 Thread Stanley Xu
set the mapreduce.queue.name in the configuaration object the job use

在 2011-2-22 下午11:42,Marc Sturlese marc.sturl...@gmail.com写道:

 I'm trying to use the fair scheduler. I have jobs written using the new
api
 and hadoop 0.20.2.
 I've seen that to associate a job with a queue you have to do:
 JobConf.setQueueName()
 The Job class of the new api has not this class. How can I do that?
 Thanks in advance.

 --
 View this message in context:
http://lucene.472066.n3.nabble.com/JobConf-setQueueName-xxx-with-the-new-api-using-hadoop-0-20-2-tp2553042p2553042.html
 Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


How could I let my map-reduce job use the log4j.properties configuration in the jar file contains the map-reduce classes?

2011-02-16 Thread Stanley Xu
Dear Buddies,

I am running a map reduce job in a jar file through the shell like the
following:

#! /bin/sh
export HADOOP_CLASSPATH=/home/xuwh/log-fetcher.jar
export CLASSPATH=/home/xuwh/:$CLASSPATH
/opt/hadoop/bin/hadoop
com.companyname.context.processor.log.preprocessor.LogCleaner

In the LogCleaner class, besides submitting the map-reduce job, I will wait
for the job completion and send the result to a server for further
processing.

I added some logs through log4j in the log uploading part and I wanted to
receive the error logs through an SMTP appender, and I created my own
log4j.properties file in the jar contains the LogCleaner but it didn't work.

I don't want to change the log configuration in the hadoop for I have to
change the configuration in all nodes. And different map-reduce jar might
have different configuration in log4j.

Is there any way I could make the log4j code in jar file using the
log4j.properties inside the jar? Not the code in the map-reduce job, but the
code to setup the job and the code to process after the job is completed.

Thanks.

Best wishes,
Stanley Xu