Re: Multiple Output Format -Unrecognizable Characters in Output File

2011-07-18 Thread Yaozhen Pan
Hi James,

Not sure if you meant to write both key and value as text.
key.write(output);
This line of code writes long numbers as binary format, that might be the
reason you saw unrecognizable characters in output file.

Yaozhen

On Mon, Jul 18, 2011 at 2:00 PM, Teng, James xt...@ebay.com wrote:

 ** **

 Hi,

 I encounter a problem why try to define my own MultipleOutputFormat class,
 here is the codes bellow.

 *public* *class* MultipleOutputFormat 
 *extends*FileOutputFormatLongWritable,Text{
 

   *public* *class* LineWriter *extends*RecordWriterLongWritable,Text{
 

 *private* DataOutputStream output;

 *private* *byte* *separatorBytes*[];

 *public* LineWriter(DataOutputStream output, String separator)
 *throws* UnsupportedEncodingException

 {

   *this*.output=output;

   *this*.separatorBytes=separator.getBytes(UTF-8);

 }

 @Override

 *public* *synchronized* *void* close(TaskAttemptContext
 context) *throws* IOException,

 InterruptedException {

   // *TODO* Auto-generated method stub

   output.close();

 }

 ** **

 @Override

 *public* *void* write(LongWritable key, Text value) 
 *throws*IOException,
 

 InterruptedException {

   System.*out*.println(key:+key.get());

   System.*out*.println(value:+value.toString());

   // *TODO* Auto-generated method stub

   //output.writeLong(key.)

   //output.write(separatorBytes);

   //output.write(value.toString().getBytes(UTF-8));

   //output.write(\n.getBytes(UTF-8));

   //key.write(output);

   key.write(output);

 value.write(output);

 ** **

   output.write(\n.getBytes(UTF-8));

 }

   }

   *private* Path *path*;

   *protected* String generateFileNameForKeyValue(LongWritable key,Text
 value,String name)

   {

 *return* key+Math.*random*();

   }

 ** **

   @Override

   *public* RecordWriterLongWritable, Text getRecordWriter(

   TaskAttemptContext context) *throws* IOException,
 InterruptedException {

 path=*getOutputPath*(context);

 System.*out*.println(
 d
 );

 // *TODO* Auto-generated method stub

 Path file = getDefaultWorkFile(context, );

 FileSystem fs = file.getFileSystem(context.getConfiguration());
 

 ** **

 FSDataOutputStream fileOut = fs.create(file, *false*);

 ** **

 *return* *new* LineWriter(fileOut, \t);

 ** **

   }

 ** **

 however, there is a problem of unrecognizable characters occurrences in the
 output file,

 is there any one encounter the problem before, any comment is greatly
 appreciated, thanks in advance.

 ** **

  

 *James, Teng (Teng Linxiao)*

 *eRL,   CDC,eBay,Shanghai*

 *Extension*:86-21-28913530

 *MSN*: tenglinx...@hotmail.com

 *Skype*:James,Teng

 *Email*:xt...@ebay.com

 



Does hadoop local mode support running multiple jobs in different threads?

2011-07-01 Thread Yaozhen Pan
Hi,

I am not sure if this question (as title) has been asked before, but I
didn't find an answer by googling.

I'd like to explain the scenario of my problem:
My program launches several threads in the same time, while each thread will
submit a hadoop job and wait for the job to complete.
The unit tests were run in local mode, mini-cluster and the real hadoop
cluster.
I found the unit tests may fail in local mode, but they always succeeded in
mini-cluster and real hadoop cluster.
When unit test failed in local mode, the causes may be different (stack
traces are posted at the end of mail).

It seems multi-thead running multiple jobs is not supported in local mode,
is it?

Error 1:
2011-07-01 20:24:36,460 WARN  [Thread-38] mapred.LocalJobRunner
(LocalJobRunner.java:run(256)) - job_local_0001
java.io.FileNotFoundException: File
build/test/tmp/mapred/local/taskTracker/jobcache/job_local_0001/attempt_local_0001_m_00_0/output/spill0.out
does not exist.
at
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:361)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.RawLocalFileSystem.rename(RawLocalFileSystem.java:253)
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1447)
at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154)
at
org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:549)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:623)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:177)

Error 2:
2011-07-01 19:00:25,546 INFO  [Thread-32] fs.FSInputChecker
(FSInputChecker.java:readChecksumChunk(247)) - Found checksum error: b[3584,
4096]=696f6e69643c2f6e616d653e3c76616c75653e47302e4120636f696e636964656e63652047312e413c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a6f622e747261636b65722e706572736973742e6a6f627374617475732e6469723c2f6e616d653e3c76616c75653e2f6a6f62747261636b65722f6a6f6273496e666f3c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6d61707265642e6a61723c2f6e616d653e3c76616c75653e66696c653a2f686f6d652f70616e797a682f6861646f6f7063616c632f6275696c642f746573742f746d702f6d61707265642f73797374656d2f6a6f625f6c6f63616c5f303030332f6a6f622e6a61723c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e73332e627565722e6469723c2f6e616d653e3c76616c75653e247b6861646f6f702e746d702e6469727d2f7c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e6a6f622e656e642e72657472792e617474656d7074733c2f6e616d653e3c76616c75653e303c2f76616c75653e3c2f70726f70657274793e0a3c70726f70657274793e3c6e616d653e66732e66696c652e696d706c3c2f6e616d653e3c76616c75653e6f
org.apache.hadoop.fs.ChecksumException: Checksum error:
file:/home/hadoop-user/hadoop-proj/build/test/tmp/mapred/system/job_local_0003/job.xml
at 3584
at org.apache.hadoop.fs.FSInputChecker.verifySum(FSInputChecker.java:277)
at
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:241)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:83)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:49)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:87)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:209)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:61)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1197)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.init(LocalJobRunner.java:92)
at
org.apache.hadoop.mapred.LocalJobRunner.submitJob(LocalJobRunner.java:373)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:800)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:448)
at hadoop.GroupingRunnable.run(GroupingRunnable.java:126)
at java.lang.Thread.run(Thread.java:619)


Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files

2011-06-22 Thread Yaozhen Pan
Hi,

I am using Eclipse Helios Service Release 2.
I encountered a similar problem (map/reduce perspective failed to load) when
upgrading eclipse plugin from 0.20.2 to 0.20.3-append version.

I compared the source code of eclipse plugin and found only a few
difference. I tried to revert the differences one by one to see if it can
work.
What surprised me was that when I only reverted the jar name from
hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar, it
worked in eclipse.

Yaozhen


On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.comwrote:

 I am doing that.. its not working.. If I am replacing the hadoop-core from
 hadoop-plugin.jar.. I am not able to see map-reduce perspective at all.
 Guys.. any help.. !!!

 Thanks,
 Praveenesh

 On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com wrote:

  Every time when hadoop builds, it also builds the hadoop eclipse plug-in
  using the latest hadoop core jar. In your case eclipse plug-in contains
 the
  other version jar and cluster is running with other version. That's why
 it
  is giving the version mismatch error.
 
 
 
  Just replace the hadoop-core jar in your eclipse plug-in with the jar
  whatever the hadoop cluster is using  and check.
 
 
 
  Devaraj K
 
   _
 
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Wednesday, June 22, 2011 12:07 PM
  To: common-user@hadoop.apache.org; devara...@huawei.com
  Subject: Re: Hadoop eclipse plugin stopped working after replacing
  hadoop-0.20.2 jar files with hadoop-0.20-append jar files
 
 
 
   I followed michael noll's tutorial for making hadoop-0-20-append jars..
 
 
 
 http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio
  n-for-hbase-0-90-2/
 
  After following the article.. we get 5 jar files which we need to replace
  it
  from hadoop.0.20.2 jar file.
  There is no jar file for hadoop-eclipse plugin..that I can see in my
  repository if I follow that tutorial.
 
  Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280
  regarding whether it is compatible with hadoop-0.20-append.
 
  Does anyone else. faced this kind of issue ???
 
  Thanks,
  Praveenesh
 
 
 
  On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com
 wrote:
 
  Hadoop eclipse plugin also uses hadoop-core.jar file communicate to the
  hadoop cluster. For this it needs to have same version of hadoop-core.jar
  for client as well as server(hadoop cluster).
 
  Update the hadoop eclipse plugin for your eclipse which is provided with
  hadoop-0.20-append release, it will work fine.
 
 
  Devaraj K
 
  -Original Message-
  From: praveenesh kumar [mailto:praveen...@gmail.com]
  Sent: Wednesday, June 22, 2011 11:25 AM
  To: common-user@hadoop.apache.org
  Subject: Hadoop eclipse plugin stopped working after replacing
  hadoop-0.20.2
  jar files with hadoop-0.20-append jar files
 
 
  Guys,
  I was using hadoop eclipse plugin on hadoop 0.20.2 cluster..
  It was working fine for me.
  I was using Eclipse SDK Helios 3.6.2 with the plugin
  hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA
  MAPREDUCE-1280
 
  Now for Hbase installation.. I had to use hadoop-0.20-append compiled
  jars..and I had to replace the old jar files with new 0.20-append
 compiled
  jar files..
  But now after replacing .. my hadoop eclipse plugin is not working well
 for
  me.
  Whenever I am trying to connect to my hadoop master node from that and
 try
  to see DFS locations..
  it is giving me the following error:
  *
  Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version
  mismatch (client 41 server 43)*
 
  However the hadoop cluster is working fine if I go directly on hadoop
  namenode use hadoop commands..
  I can add files to HDFS.. run jobs from there.. HDFS web console and
  Map-Reduce web console are also working fine. but not able to use my
  previous hadoop eclipse plugin.
 
  Any suggestions or help for this issue ?
 
  Thanks,
  Praveenesh
 
 
 
 



Re: Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files

2011-06-22 Thread Yaozhen Pan
Hi,

Our hadoop version was built on 0.20-append with a few patches.
However, I didn't see big differences in eclipse-plugin.

Yaozhen

On Thu, Jun 23, 2011 at 11:29 AM, 叶达峰 (Jack Ye) kobe082...@qq.com wrote:

 do you use hadoop 0.20.203.0?
 I also have problem about this plugin.

 Yaozhen Pan itzhak@gmail.com编写:

 Hi,
 
 I am using Eclipse Helios Service Release 2.
 I encountered a similar problem (map/reduce perspective failed to load)
 when
 upgrading eclipse plugin from 0.20.2 to 0.20.3-append version.
 
 I compared the source code of eclipse plugin and found only a few
 difference. I tried to revert the differences one by one to see if it can
 work.
 What surprised me was that when I only reverted the jar name from
 hadoop-0.20.3-eclipse-plugin.jar to hadoop-0.20.2-eclipse-plugin.jar,
 it
 worked in eclipse.
 
 Yaozhen
 
 
 On Thu, Jun 23, 2011 at 1:22 AM, praveenesh kumar praveen...@gmail.com
 wrote:
 
  I am doing that.. its not working.. If I am replacing the hadoop-core
 from
  hadoop-plugin.jar.. I am not able to see map-reduce perspective at all.
  Guys.. any help.. !!!
 
  Thanks,
  Praveenesh
 
  On Wed, Jun 22, 2011 at 12:34 PM, Devaraj K devara...@huawei.com
 wrote:
 
   Every time when hadoop builds, it also builds the hadoop eclipse
 plug-in
   using the latest hadoop core jar. In your case eclipse plug-in
 contains
  the
   other version jar and cluster is running with other version. That's
 why
  it
   is giving the version mismatch error.
  
  
  
   Just replace the hadoop-core jar in your eclipse plug-in with the jar
   whatever the hadoop cluster is using  and check.
  
  
  
   Devaraj K
  
_
  
   From: praveenesh kumar [mailto:praveen...@gmail.com]
   Sent: Wednesday, June 22, 2011 12:07 PM
   To: common-user@hadoop.apache.org; devara...@huawei.com
   Subject: Re: Hadoop eclipse plugin stopped working after replacing
   hadoop-0.20.2 jar files with hadoop-0.20-append jar files
  
  
  
I followed michael noll's tutorial for making hadoop-0-20-append
 jars..
  
  
  
 
 http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-versio
   n-for-hbase-0-90-2/
  
   After following the article.. we get 5 jar files which we need to
 replace
   it
   from hadoop.0.20.2 jar file.
   There is no jar file for hadoop-eclipse plugin..that I can see in my
   repository if I follow that tutorial.
  
   Also the hadoop-plugin I am using..has no info on JIRA MAPREDUCE-1280
   regarding whether it is compatible with hadoop-0.20-append.
  
   Does anyone else. faced this kind of issue ???
  
   Thanks,
   Praveenesh
  
  
  
   On Wed, Jun 22, 2011 at 11:48 AM, Devaraj K devara...@huawei.com
  wrote:
  
   Hadoop eclipse plugin also uses hadoop-core.jar file communicate to
 the
   hadoop cluster. For this it needs to have same version of
 hadoop-core.jar
   for client as well as server(hadoop cluster).
  
   Update the hadoop eclipse plugin for your eclipse which is provided
 with
   hadoop-0.20-append release, it will work fine.
  
  
   Devaraj K
  
   -Original Message-
   From: praveenesh kumar [mailto:praveen...@gmail.com]
   Sent: Wednesday, June 22, 2011 11:25 AM
   To: common-user@hadoop.apache.org
   Subject: Hadoop eclipse plugin stopped working after replacing
   hadoop-0.20.2
   jar files with hadoop-0.20-append jar files
  
  
   Guys,
   I was using hadoop eclipse plugin on hadoop 0.20.2 cluster..
   It was working fine for me.
   I was using Eclipse SDK Helios 3.6.2 with the plugin
   hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA
   MAPREDUCE-1280
  
   Now for Hbase installation.. I had to use hadoop-0.20-append compiled
   jars..and I had to replace the old jar files with new 0.20-append
  compiled
   jar files..
   But now after replacing .. my hadoop eclipse plugin is not working
 well
  for
   me.
   Whenever I am trying to connect to my hadoop master node from that and
  try
   to see DFS locations..
   it is giving me the following error:
   *
   Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol
 version
   mismatch (client 41 server 43)*
  
   However the hadoop cluster is working fine if I go directly on hadoop
   namenode use hadoop commands..
   I can add files to HDFS.. run jobs from there.. HDFS web console and
   Map-Reduce web console are also working fine. but not able to use my
   previous hadoop eclipse plugin.
  
   Any suggestions or help for this issue ?
  
   Thanks,
   Praveenesh
  
  
  
  
 



Re: Make reducer task exit early

2011-06-04 Thread Yaozhen Pan
It can be achieved by overwriting Reducer.run() in new mapreduce API.
But I don't know how to achieve it in old API.

On Sat, Jun 4, 2011 at 8:14 AM, Aaron Baff aaron.b...@telescope.tv wrote:

 Is there a way to make a Reduce task exit early before it has finished
 reading all of it's data? Basically I'm doing a group by with a sum, and I
 only want to return the top 1000 records say. So I have local class int
 variable to keep track of how many have current been written to the output,
 and as soon as that is exceeded, simply return at the top of the reduce()
 function.

 Is there any way to optimize it even more to tell the Reduce task, stop
 reading data, I don't need any more data?

 --Aaron