Re: Multiple Output Format -Unrecognizable Characters in Output File

2011-07-18 Thread Yaozhen Pan
Hi James,

Not sure if you meant to write both key and value as text.
key.write(output);
This line of code writes long numbers as binary format, that might be the
reason you saw unrecognizable characters in output file.

Yaozhen

On Mon, Jul 18, 2011 at 2:00 PM, Teng, James xt...@ebay.com wrote:

 ** **

 Hi,

 I encounter a problem why try to define my own MultipleOutputFormat class,
 here is the codes bellow.

 *public* *class* MultipleOutputFormat 
 *extends*FileOutputFormatLongWritable,Text{
 

   *public* *class* LineWriter *extends*RecordWriterLongWritable,Text{
 

 *private* DataOutputStream output;

 *private* *byte* *separatorBytes*[];

 *public* LineWriter(DataOutputStream output, String separator)
 *throws* UnsupportedEncodingException

 {

   *this*.output=output;

   *this*.separatorBytes=separator.getBytes(UTF-8);

 }

 @Override

 *public* *synchronized* *void* close(TaskAttemptContext
 context) *throws* IOException,

 InterruptedException {

   // *TODO* Auto-generated method stub

   output.close();

 }

 ** **

 @Override

 *public* *void* write(LongWritable key, Text value) 
 *throws*IOException,
 

 InterruptedException {

   System.*out*.println(key:+key.get());

   System.*out*.println(value:+value.toString());

   // *TODO* Auto-generated method stub

   //output.writeLong(key.)

   //output.write(separatorBytes);

   //output.write(value.toString().getBytes(UTF-8));

   //output.write(\n.getBytes(UTF-8));

   //key.write(output);

   key.write(output);

 value.write(output);

 ** **

   output.write(\n.getBytes(UTF-8));

 }

   }

   *private* Path *path*;

   *protected* String generateFileNameForKeyValue(LongWritable key,Text
 value,String name)

   {

 *return* key+Math.*random*();

   }

 ** **

   @Override

   *public* RecordWriterLongWritable, Text getRecordWriter(

   TaskAttemptContext context) *throws* IOException,
 InterruptedException {

 path=*getOutputPath*(context);

 System.*out*.println(
 d
 );

 // *TODO* Auto-generated method stub

 Path file = getDefaultWorkFile(context, );

 FileSystem fs = file.getFileSystem(context.getConfiguration());
 

 ** **

 FSDataOutputStream fileOut = fs.create(file, *false*);

 ** **

 *return* *new* LineWriter(fileOut, \t);

 ** **

   }

 ** **

 however, there is a problem of unrecognizable characters occurrences in the
 output file,

 is there any one encounter the problem before, any comment is greatly
 appreciated, thanks in advance.

 ** **

  

 *James, Teng (Teng Linxiao)*

 *eRL,   CDC,eBay,Shanghai*

 *Extension*:86-21-28913530

 *MSN*: tenglinx...@hotmail.com

 *Skype*:James,Teng

 *Email*:xt...@ebay.com

 



Re: [Doubt]: Submission of Mapreduce from outside Hadoop Cluster

2011-07-01 Thread Yaozhen Pan
Narayanan,

Regarding the client installation, you should make sure that client and
server use same version hadoop for submitting jobs and transfer data.
if you use a different user in client than the one runs hadoop job, config
the hadoop ugi property (sorry i forget the exact name).

在 2011 7 1 15:28,Narayanan K knarayana...@gmail.com写道:
 Hi Harsh

 Thanks for the quick response...

 Have a few clarifications regarding the 1st point :

 Let me tell the background first..

 We have actually set up a Hadoop cluster with HBase installed. We are
 planning to load Hbase with data and perform some
 computations with the data and show up the data in a report format.
 The report should be accessible from outside the cluster and the report
 accepts certain parameters to show data, that will in turn pass on these
 parameters to the hadoop master server where a mapreduce job will be run
 that queries HBase to retrieve the data.

 So the report will be run from a different machine outside the cluster. So
 we need a way to pass on the parameters to the hadoop cluster (master) and
 initiate a mapreduce job dynamically. Similarly the output of mapreduce
job
 needs to tunneled into the machine from where the report was run.

 Some more clarification I need is : Does the machine (outside of cluster)
 which ran the report, require something like a Client installation which
 will talk with the Hadoop Master Server via TCP??? Or can it can run a job
 in hadoop server by using a passworldless scp to the master machine or
 something of the like.


 Regards,
 Narayanan




 On Fri, Jul 1, 2011 at 11:41 AM, Harsh J ha...@cloudera.com wrote:

 Narayanan,


 On Fri, Jul 1, 2011 at 11:28 AM, Narayanan K knarayana...@gmail.com
 wrote:
  Hi all,
 
  We are basically working on a research project and I require some help
  regarding this.

 Always glad to see research work being done! What're you working on? :)

  How do I submit a mapreduce job from outside the cluster i.e from a
  different machine outside the Hadoop cluster?

 If you use Java APIs, use the Job#submit(…) method and/or
 JobClient.runJob(…) method.
 Basically Hadoop will try to create a jar with all requisite classes
 within and will push it out to the JobTracker's filesystem (HDFS, if
 you run HDFS). From there on, its like a regular operation.

 This even happens on the Hadoop nodes itself, so doing so from an
 external place as long as that place has access to Hadoop's JT and
 HDFS, should be no different at all.

 If you are packing custom libraries along, don't forget to use
 DistributedCache. If you are packing custom MR Java code, don't forget
 to use Job#setJarByClass/JobClient#setJarByClass and other appropriate
 API methods.

  If the above can be done, How can I schedule map reduce jobs to run in
  hadoop like crontab from a different machine?
  Are there any webservice APIs that I can leverage to access a hadoop
 cluster
  from outside and submit jobs or read/write data from HDFS.

 For scheduling jobs, have a look at Oozie: http://yahoo.github.com/oozie/
 It is well supported and is very useful in writing MR workflows (which
 is a common requirement). You also get coordinator features and can
 schedule similar to crontab functionalities.

 For HDFS r/w over web, not sure of an existing web app specifically
 for this purpose without limitations, but there is a contrib/thriftfs
 you can leverage upon (if not writing your own webserver in Java, in
 which case its as simple as using HDFS APIs).

 Also have a look at the pretty mature Hue project which aims to
 provide a great frontend that lets you design jobs, submit jobs,
 monitor jobs and upload files or browse the filesystem (among several
 other things): http://cloudera.github.com/hue/

 --
 Harsh J



Does hadoop local mode support running multiple jobs in different threads?

2011-07-01 Thread Yaozhen Pan
Hi,

I am not sure if this question (as title) has been asked before, but I
didn't find an answer by googling.

I'd like to explain the scenario of my problem:
My program launches several threads in the same time, while each thread will
submit a hadoop job and wait for the job to complete.
The unit tests were run in local mode, mini-cluster and the real hadoop
cluster.
I found the unit tests may fail in local mode, but they always succeeded in
mini-cluster and real hadoop cluster.
When unit test failed in local mode, the causes may be different (stack
traces are posted at the end of mail).

It seems multi-thead running multiple jobs is not supported in local mode,
is it?

Error 1:
2011-07-01 20:24:36,460 WARN  [Thread-38] mapred.LocalJobRunner
(LocalJobRunner.java:run(256)) - job_local_0001
java.io.FileNotFoundException: File
build/test/tmp/mapred/local/taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1447)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:549)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:623)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Error 2:
2011-07-01 19:00:25,546 INFO  [Thread-32] fs.FSInputChecker
(FSInputChecker.java:readChecksumChunk(247)) - Found checksum error: b[3584,
4096]=696f6e69643c2f6e616d653e3c76616c75653e47302e4120636f696e636964656e63652047312e413c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a6f622e747261636b65722e706572736973742e6a6f627374617475732e6469723c2f6e616d653e3c76616c75653e2f6a6f62747261636b65722f6a6f6273496e666f3c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a61723c2f6e616d653e3c76616c75653e66696c653a2f686f6d652f70616e797a682f6861646f6f7063616c632f6275696c642f746573742f746d702f6d61707265642f73797374656d2f6a6f625f6c6f63616c5f303030332f6a6f622e6a61723c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e73332e627565722e6469723c2f6e616d653e3c76616c75653e247b6861646f6f702e746d702e6469727d2f7c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6a6f622e656e642e72657472792e617474656d7074733c2f6e616d653e3c76616c75653e303c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e66696c652e696d706c3c2f6e616d653e3c76616c75653e6f
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/hadoop-user/hadoop-proj/build/test/tmp/mapred/system/job_local_0003/job.xml
at 3584
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:49)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:209)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.init(LocalJobRunner.java:92)
at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
at hadoop.GroupingRunnable.run(GroupingRunnable.java:126)
at java.lang.Thread.run(Thread.java:619)


Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files

2011-06-22 Thread Yaozhen Pan
Hi,

I am using Eclipse Helios Service Release 2.
I encountered a similar problem (map/reduce perspective failed to load) when
upgrading eclipse plugin from 0.20.2 to 0.20.3-append version.

I compared the source code of eclipse plugin and found only a few
difference. I tried to revert the differences one by one to see if it can
work.
What surprised me was that when I only reverted the jar name from
hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar, it
worked in eclipse.

Yaozhen


On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.comwrote:

 I am doing that.. its not working.. If I am replacing the hadoop-core from
 hadoop-plugin.jar.. I am not able to see map-reduce perspective at all.
 Guys.. any help.. !!!

 Thanks,
 Praveenesh

 On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com wrote:

  Every time when hadoop builds, it also builds the hadoop eclipse plug-in
  using the latest hadoop core jar. In your case eclipse plug-in contains
 the
  other version jar and cluster is running with other version. That's why
 it
  is giving the version mismatch error.
 
 
 
  Just replace the hadoop-core jar in your eclipse plug-in with the jar
  whatever the hadoop cluster is using  and check.
 
 
 
  Devaraj K
 
   _
 
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Wednesday, June 22, 2011 12:07 PM
  To: common-user@hadoop.apache.org; devara...@huawei.com
  Subject: Re: Hadoop eclipse plugin stopped working after replacing
  hadoop-0.20.2 jar files with hadoop-0.20-append jar files
 
 
 
   I followed michael noll's tutorial for making hadoop-0-20-append jars..
 
 
 
 http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio
  n-for-hbase-0-90-2/
 
  After following the article.. we get 5 jar files which we need to replace
  it
  from hadoop.0.20.2 jar file.
  There is no jar file for hadoop-eclipse plugin..that I can see in my
  repository if I follow that tutorial.
 
  Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280
  regarding whether it is compatible with hadoop-0.20-append.
 
  Does anyone else. faced this kind of issue ???
 
  Thanks,
  Praveenesh
 
 
 
  On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com
 wrote:
 
  Hadoop eclipse plugin also uses hadoop-core.jar file communicate to the
  hadoop cluster. For this it needs to have same version of hadoop-core.jar
  for client as well as server(hadoop cluster).
 
  Update the hadoop eclipse plugin for your eclipse which is provided with
  hadoop-0.20-append release, it will work fine.
 
 
  Devaraj K
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Wednesday, June 22, 2011 11:25 AM
  To: common-user@hadoop.apache.org
  Subject: Hadoop eclipse plugin stopped working after replacing
  hadoop-0.20.2
  jar files with hadoop-0.20-append jar files
 
 
  Guys,
  I was using hadoop eclipse plugin on hadoop 0.20.2 cluster..
  It was working fine for me.
  I was using Eclipse SDK Helios 3.6.2 with the plugin
  hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA
  MAPREDUCE-1280
 
  Now for Hbase installation.. I had to use hadoop-0.20-append compiled
  jars..and I had to replace the old jar files with new 0.20-append
 compiled
  jar files..
  But now after replacing .. my hadoop eclipse plugin is not working well
 for
  me.
  Whenever I am trying to connect to my hadoop master node from that and
 try
  to see DFS locations..
  it is giving me the following error:
  *
  Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version
  mismatch (client 41 server 43)*
 
  However the hadoop cluster is working fine if I go directly on hadoop
  namenode use hadoop commands..
  I can add files to HDFS.. run jobs from there.. HDFS web console and
  Map-Reduce web console are also working fine. but not able to use my
  previous hadoop eclipse plugin.
 
  Any suggestions or help for this issue ?
 
  Thanks,
  Praveenesh
 
 
 
 



Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files

2011-06-22 Thread Yaozhen Pan
Hi,

Our hadoop version was built on 0.20-append with a few patches.
However, I didn't see big differences in eclipse-plugin.

Yaozhen

On Thu, Jun 23, 2011 at 11:29 AM, 叶达峰 (Jack Ye) kobe082...@qq.com wrote:

 do you use hadoop 0.20.203.0?
 I also have problem about this plugin.

 Yaozhen Pan itzhak@gmail.com编写:

 Hi,
 
 I am using Eclipse Helios Service Release 2.
 I encountered a similar problem (map/reduce perspective failed to load)
 when
 upgrading eclipse plugin from 0.20.2 to 0.20.3-append version.
 
 I compared the source code of eclipse plugin and found only a few
 difference. I tried to revert the differences one by one to see if it can
 work.
 What surprised me was that when I only reverted the jar name from
 hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar,
 it
 worked in eclipse.
 
 Yaozhen
 
 
 On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.com
 wrote:
 
  I am doing that.. its not working.. If I am replacing the hadoop-core
 from
  hadoop-plugin.jar.. I am not able to see map-reduce perspective at all.
  Guys.. any help.. !!!
 
  Thanks,
  Praveenesh
 
  On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com
 wrote:
 
   Every time when hadoop builds, it also builds the hadoop eclipse
 plug-in
   using the latest hadoop core jar. In your case eclipse plug-in
 contains
  the
   other version jar and cluster is running with other version. That's
 why
  it
   is giving the version mismatch error.
  
  
  
   Just replace the hadoop-core jar in your eclipse plug-in with the jar
   whatever the hadoop cluster is using  and check.
  
  
  
   Devaraj K
  
_
  
   From: praveenesh kumar [mailto:praveen...@gmail.com]
   Sent: Wednesday, June 22, 2011 12:07 PM
   To: common-user@hadoop.apache.org; devara...@huawei.com
   Subject: Re: Hadoop eclipse plugin stopped working after replacing
   hadoop-0.20.2 jar files with hadoop-0.20-append jar files
  
  
  
I followed michael noll's tutorial for making hadoop-0-20-append
 jars..
  
  
  
 
 http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio
   n-for-hbase-0-90-2/
  
   After following the article.. we get 5 jar files which we need to
 replace
   it
   from hadoop.0.20.2 jar file.
   There is no jar file for hadoop-eclipse plugin..that I can see in my
   repository if I follow that tutorial.
  
   Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280
   regarding whether it is compatible with hadoop-0.20-append.
  
   Does anyone else. faced this kind of issue ???
  
   Thanks,
   Praveenesh
  
  
  
   On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com
  wrote:
  
   Hadoop eclipse plugin also uses hadoop-core.jar file communicate to
 the
   hadoop cluster. For this it needs to have same version of
 hadoop-core.jar
   for client as well as server(hadoop cluster).
  
   Update the hadoop eclipse plugin for your eclipse which is provided
 with
   hadoop-0.20-append release, it will work fine.
  
  
   Devaraj K
  
   -Original Message-
   From: praveenesh kumar [mailto:praveen...@gmail.com]
   Sent: Wednesday, June 22, 2011 11:25 AM
   To: common-user@hadoop.apache.org
   Subject: Hadoop eclipse plugin stopped working after replacing
   hadoop-0.20.2
   jar files with hadoop-0.20-append jar files
  
  
   Guys,
   I was using hadoop eclipse plugin on hadoop 0.20.2 cluster..
   It was working fine for me.
   I was using Eclipse SDK Helios 3.6.2 with the plugin
   hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA
   MAPREDUCE-1280
  
   Now for Hbase installation.. I had to use hadoop-0.20-append compiled
   jars..and I had to replace the old jar files with new 0.20-append
  compiled
   jar files..
   But now after replacing .. my hadoop eclipse plugin is not working
 well
  for
   me.
   Whenever I am trying to connect to my hadoop master node from that and
  try
   to see DFS locations..
   it is giving me the following error:
   *
   Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol
 version
   mismatch (client 41 server 43)*
  
   However the hadoop cluster is working fine if I go directly on hadoop
   namenode use hadoop commands..
   I can add files to HDFS.. run jobs from there.. HDFS web console and
   Map-Reduce web console are also working fine. but not able to use my
   previous hadoop eclipse plugin.
  
   Any suggestions or help for this issue ?
  
   Thanks,
   Praveenesh
  
  
  
  
 



Re: Make reducer task exit early

2011-06-04 Thread Yaozhen Pan
It can be achieved by overwriting Reducer.run() in new mapreduce API.
But I don't know how to achieve it in old API.

On Sat, Jun 4, 2011 at 8:14 AM, Aaron Baff aaron.b...@telescope.tv wrote:

 Is there a way to make a Reduce task exit early before it has finished
 reading all of it's data? Basically I'm doing a group by with a sum, and I
 only want to return the top 1000 records say. So I have local class int
 variable to keep track of how many have current been written to the output,
 and as soon as that is exceeded, simply return at the top of the reduce()
 function.

 Is there any way to optimize it even more to tell the Reduce task, stop
 reading data, I don't need any more data?

 --Aaron



Re: Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster

2011-05-31 Thread Yaozhen Pan
How many datanodes are in your cluster? and what is the value of
dfs.replication in hdfs-site.xml (if not specified, default value is 3)?

From the error log, it seems there are not enough datanodes to replicate the
files in hdfs.

在 2011 5 31 22:23,Harsh J ha...@cloudera.com写道:
Xu,

Please post the output of `hadoop dfsadmin -report` and attach the
tail of a started DN's log?


On Tue, May 31, 2011 at 7:44 PM, Xu, Richard richard...@citi.com wrote:
 2. Also, Configured Cap...
This might easily be the cause. I'm not sure if its a Solaris thing
that can lead to this though.


 3. in datanode server, no error in logs, but tasktracker logs has the
following suspicious thing:...
I don't see any suspicious log message in what you'd posted. Anyhow,
the TT does not matter here.

--
Harsh J