Is hdfs reliable? Very odd error

2010-08-13 Thread Raymond Jennings III
I copied a 230GB file into my hadoop cluster.  After my MR job kept failing I 
tracked down the error to one line of formatted text.

I copied the file back out of hdfs and when I compare it to the original file 
there are about 20 bytes on one line (out of 230GB) that are different.

Is there no CRC or checksum done when copying files into hdfs?

(Just to be clear, I copied the original file out of hdfs - not the output of 
my 
MR job.)



  


Preferred Java version

2010-07-16 Thread Raymond Jennings III
Is 1.6.0_17 or 1.6.0_20 preferred as the JRE for hadoop?  Thank you.



  


Help with Hadoop runtime error

2010-07-09 Thread Raymond Jennings III
Does anyone know what might be causing this error?  I am using version Hadoop 
0.20.2 and it happens when I run bin/hadoop dfs -copyFromLocal ...

10/07/09 15:51:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream 
java.io.IOException: Bad connect ack with firstBadLink 128.238.55.43:50010
10/07/09 15:51:45 INFO hdfs.DFSClient: Abandoning block 
blk_2932625575574450984_1002


  


Re: Help with Hadoop runtime error

2010-07-09 Thread Raymond Jennings III
Hi Ted, thanks for your replay.  That does not seem to make a difference 
though.  I put that property in the xml file, restarted everything, tried to 
transfer the file again but the same thing occurred.

I had my cluster working perfectly for about a year but I recently had some 
disk 
failures and scrubbed all of my machines reinstalled linux (same version) and 
moved from hadoop 0.20.1 to 0.20.2.



- Original Message 
From: Ted Yu yuzhih...@gmail.com
To: common-user@hadoop.apache.org
Sent: Fri, July 9, 2010 4:26:30 PM
Subject: Re: Help with Hadoop runtime error

Please see the description about xcievers at:
http://hbase.apache.org/docs/r0.20.5/api/overview-summary.html#requirements

You can confirm that you have a xcievers problem by grepping the
datanode logs with the error message pasted in the last bullet point.

On Fri, Jul 9, 2010 at 1:10 PM, Raymond Jennings III
raymondj...@yahoo.comwrote:

 Does anyone know what might be causing this error?  I am using version
 Hadoop
 0.20.2 and it happens when I run bin/hadoop dfs -copyFromLocal ...

 10/07/09 15:51:45 INFO hdfs.DFSClient: Exception in createBlockOutputStream
 java.io.IOException: Bad connect ack with firstBadLink 128.238.55.43:50010
 10/07/09 15:51:45 INFO hdfs.DFSClient: Abandoning block
 blk_2932625575574450984_1002







  


Newbie to HDFS compression

2010-06-24 Thread Raymond Jennings III
Are there instructions on how to enable (which type?) of compression on hdfs?  
Does this have to be done during installation or can it be added to a running 
cluster?

Thanks,
Ray


  


Re: Newbie to HDFS compression

2010-06-24 Thread Raymond Jennings III
Oh, maybe that's what I meant :-)  I recall reading something on this mail 
group that the compression in not included with the hadoop binary and that 
you have to get and install it separately due to license incompatibilities.  
Looking at the config xml files it's not clear what I need to do.  Thanks.



- Original Message 
From: Eric Sammer esam...@cloudera.com
To: common-user@hadoop.apache.org
Sent: Thu, June 24, 2010 5:09:33 PM
Subject: Re: Newbie to HDFS compression

There is no file system level compression in HDFS. You can stored
compressed files in HDFS, however.

On Thu, Jun 24, 2010 at 11:26 AM, Raymond Jennings III
raymondj...@yahoo.com wrote:
 Are there instructions on how to enable (which type?) of compression on hdfs? 
  Does this have to be done during installation or can it be added to a 
 running cluster?

 Thanks,
 Ray







-- 
Eric Sammer
twitter: esammer
data: www.cloudera.com



  


Which version of java is the preferred version?

2010-06-18 Thread Raymond Jennings III
I recall reading sometime ago on this mailing list that certain JRE versions 
were recommended and others were not.  Was it 1.6.0_17 the preferred?  

Thank you.



  


Cutom partitioner question

2010-06-03 Thread Raymond Jennings III
I am trying to create my partitioner but I am getting an exception.  Is 
anything required other than providing the method public int getPartition and 
extending the Partitioner class?



java.lang.RuntimeException: java.lang.NoSuchMethodException: 
TSPmrV6$TSPPartitioner.init()
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:527)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NoSuchMethodException: TSPmrV6$TSPPartitioner.init()
at java.lang.Class.getConstructor0(Unknown Source)
at java.lang.Class.getDeclaredConstructor(Unknown Source)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
... 4 more



  


Re: Cutom partitioner question

2010-06-03 Thread Raymond Jennings III
Hi Ted, that does not appear to be the problem I am having.  I tried adding it 
as you said but I get the same runtime error.  Here is my partitioner:

  public class MyPartitioner extends PartitionerText, Text {
  
public MyPartitioner() {

}
  
public int getPartition(Text key, Text value, int num_partitions) {
String key2 = key.toString();
int hash = key2.hashCode();

hash = hash % num_partitions;

return(hash);
} 
  }


and in my main I have:

job.setMapOutputValueClass(Text.class);
job.setMapOutputKeyClass(Text.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(Text.class);

job.setPartitionerClass(MyPartitioner.class);


Thanks.

--- On Thu, 6/3/10, Ted Yu yuzhih...@gmail.com wrote:

 From: Ted Yu yuzhih...@gmail.com
 Subject: Re: Cutom partitioner question
 To: common-user@hadoop.apache.org
 Date: Thursday, June 3, 2010, 2:10 PM
 An empty ctor is needed for your
 Partitioner class.
 
 On Thu, Jun 3, 2010 at 10:13 AM, Raymond Jennings III
 raymondj...@yahoo.com
  wrote:
 
  I am trying to create my partitioner but I am getting
 an exception.  Is
  anything required other than providing the method
 public int getPartition
  and extending the Partitioner class?
 
 
 
  java.lang.RuntimeException:
 java.lang.NoSuchMethodException:
  TSPmrV6$TSPPartitioner.init()
         at
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
         at
 
 org.apache.hadoop.mapred.MapTask$NewOutputCollector.init(MapTask.java:527)
         at
 org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:613)
         at
 org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
         at
 org.apache.hadoop.mapred.Child.main(Child.java:170)
  Caused by: java.lang.NoSuchMethodException:
 TSPmrV6$TSPPartitioner.init()
         at
 java.lang.Class.getConstructor0(Unknown Source)
         at
 java.lang.Class.getDeclaredConstructor(Unknown Source)
         at
 
 org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
         ... 4 more
 
 
 
 
 
 





Getting zero length files on the reduce output.

2010-06-02 Thread Raymond Jennings III
I have a cluster of 12 slave nodes.  I see that for some jobs the part-r-0 
type files, half of them are zero in size after the job completes.  Does this 
mean the hash function that splits the data to each reducer node is not working 
all that well?  On other jobs it's pretty much even across all reducers but on 
certain jobs only half of the reducers have files bigger than 0.  It is 
reproducible though.  Can I change this hash function in anyway?  Thanks.


  


How can I syncronize writing to an hdfs file

2010-05-07 Thread Raymond Jennings III
I want to write to a common hdfs file from within my map method.  Given that 
each task runs in a separate jvm (on separate machines) making a method 
syncronized will not work I assume.  Are there any file locking or other 
methods to guarantee mutual exclusion on hdfs?

(I want to append to this file and I have the append option turned on.)  Thanks.


  


Decomishining a node

2010-04-23 Thread Raymond Jennings III
I've got a dead machine on my cluster.  I want to safely update HDFS so that 
nothing references this machine then I want to rebuild it and put it back in 
service in the cluster.

Does anyone have any pointers how to do this (the first part - updating HDFS so 
that it's no longer referenced.)  Thank you.


  


Re: Hadoop does not follow my setting

2010-04-22 Thread Raymond Jennings III
Isn't the number of mappers specified only a suggestion ?

--- On Thu, 4/22/10, He Chen airb...@gmail.com wrote:

 From: He Chen airb...@gmail.com
 Subject: Hadoop does not follow my setting
 To: common-user@hadoop.apache.org
 Date: Thursday, April 22, 2010, 12:50 PM
 Hi everyone
 
 I am doing a benchmark by using Hadoop 0.20.0's wordcount
 example. I have a
 30GB file. I plan to test differenct number of mappers'
 performance. For
 example, for a wordcount job, I plan to test 22 mappers, 44
 mappers, 66
 mappers and 110 mappers.
 
 However, I set the mapred.map.tasks equals to 22. But
 when I ran the job,
 it shows 436 mappers total.
 
 I think maybe the wordcount set its parameters inside the
 its own program. I
 give -Dmapred.map.tasks=22 to this program. But it is
 still 436 again in
 my another try.  I found out that 30GB divide by 436
 is just 64MB, it is
 just my block size.
 
 Any suggestions will be appreciated.
 
 Thank you in advance!
 
 -- 
 Best Wishes!
 顺送商祺!
 
 --
 Chen He
 (402)613-9298
 PhD. student of CSE Dept.
 Holland Computing Center
 University of Nebraska-Lincoln
 Lincoln NE 68588
 


   


JobTracker website data - can it be increased?

2010-04-02 Thread Raymond Jennings III
I am running an application that has many iterations and I find that the 
JobTracker's website cuts off many of the initial runs.  Is there any way to 
increase the results of previous jobs such that they are still available at the 
JobTracker's website?  Thank you.


  


why does 'jps' lose track of hadoop processes ?

2010-03-29 Thread Raymond Jennings III
After running hadoop for some period of time, the command 'jps' fails to report 
any hadoop process on any node in the cluster.  The processes are still running 
as can be seen with 'ps -ef|grep java'

In addition, scripts like stop-dfs.sh and stop-mapred.sh no longer find the 
processes to stop.


  


RE: why does 'jps' lose track of hadoop processes ?

2010-03-29 Thread Raymond Jennings III
That would explain why the processes cannot be stopped but the mystery of why 
jps loses track of these active processes still remains.  Even when jps does 
not report any hadoop process I can still submit and run jobs just fine.  I 
will have to check the next time it happens if the the hadoop pid's are the 
same as what is in the file.  If different that would somehow mean the hadoop 
process was being restarted?

--- On Mon, 3/29/10, Bill Habermaas b...@habermaas.us wrote:

 From: Bill Habermaas b...@habermaas.us
 Subject: RE: why does 'jps' lose track of hadoop processes ?
 To: common-user@hadoop.apache.org
 Date: Monday, March 29, 2010, 11:44 AM
 Sounds like your pid files are
 getting cleaned out of whatever directory
 they are being written (maybe garbage collection on a temp
 directory?). 
 
 Look at (taken from hadoop-env.sh):
 # The directory where pid files are stored. /tmp by
 default.
 # export HADOOP_PID_DIR=/var/hadoop/pids
 
 The hadoop shell scripts look in the directory that is
 defined.
 
 Bill
 
 -Original Message-
 From: Raymond Jennings III [mailto:raymondj...@yahoo.com]
 
 Sent: Monday, March 29, 2010 11:37 AM
 To: common-user@hadoop.apache.org
 Subject: why does 'jps' lose track of hadoop processes ?
 
 After running hadoop for some period of time, the command
 'jps' fails to
 report any hadoop process on any node in the cluster. 
 The processes are
 still running as can be seen with 'ps -ef|grep java'
 
 In addition, scripts like stop-dfs.sh and stop-mapred.sh no
 longer find the
 processes to stop.
 
 
       
 
 
 


 


Re: why does 'jps' lose track of hadoop processes ?

2010-03-29 Thread Raymond Jennings III
Yes, I am.

--- On Mon, 3/29/10, Bill Au bill.w...@gmail.com wrote:

 From: Bill Au bill.w...@gmail.com
 Subject: Re: why does 'jps' lose track of hadoop processes ?
 To: common-user@hadoop.apache.org
 Date: Monday, March 29, 2010, 1:04 PM
 Are you running jps under the same
 user id that the hadoop processes are
 running under?
 
 Bill
 
 On Mon, Mar 29, 2010 at 11:37 AM, Raymond Jennings III
 
 raymondj...@yahoo.com
 wrote:
 
  After running hadoop for some period of time, the
 command 'jps' fails to
  report any hadoop process on any node in the
 cluster.  The processes are
  still running as can be seen with 'ps -ef|grep java'
 
  In addition, scripts like stop-dfs.sh and
 stop-mapred.sh no longer find the
  processes to stop.
 
 
 
 
 





Question about ChainMapper

2010-03-29 Thread Raymond Jennings III
I would like to try to use a ChainMapper/ChainReducer but I see that the last 
parameter is a JobConf which I am not creating as I am using the latest API 
version.  Has anyone tried to do this with the later version API?  Can I 
extract a JobConf object somewhere?

Thanks


  


Is there a size limit on a line for a text file?

2010-03-25 Thread Raymond Jennings III
for the input to a mapper or as the output of either mapper or reducer?


  


java.io.IOException: Spill failed

2010-03-25 Thread Raymond Jennings III
Any pointers on what might be causing this?  Thanks!



java.io.IOException: Spill failed
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1006)
at java.io.DataOutputStream.write(Unknown Source)
at org.apache.hadoop.io.Text.write(Text.java:282)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90)
at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:854)
at 
org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:504)
at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at TSPmrV2$TSPMapper3.MapEmit(TSPmrV2.java:587)
at TSPmrV2$TSPMapper3.map(TSPmrV2.java:571)
at TSPmrV2$TSPMapper3.map(TSPmrV2.java:1)
at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not 
find any valid local directory for 
taskTracker/jobcache/job_201003181420_4634/attempt_201003181420_4634_m_00_0/output/spill142.out
at 
org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343)
at 
org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124)
at 
org.apache.hadoop.mapred.MapOutputFile.getSpillFileForWrite(MapOutputFile.java:107)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1183)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648)
at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)




  


Is there an easy way to clear old jobs from the jobtracker webpage?

2010-03-17 Thread Raymond Jennings III
I'd like to be able to clear the contents of the jobs that have completed 
running on the jobtracker webpage.  Is there an easy way to do this without 
restarting the cluster?


  


Can I pass a user value to my reducer?

2010-03-15 Thread Raymond Jennings III
I need to pass a counter value to my reducer from the main program.  Can this 
be done through the context parameter somehow?


  


I want to group similar keys in the reducer.

2010-03-15 Thread Raymond Jennings III
Is it possible to override a method in the reducer so that similar keys will be 
grouped together?  For example I want all keys of value KEY1 and KEY2 to 
merged together.  (My reducer has a KEY of type TEXT.)  Thanks.


  


How do I upgrade my hadoop cluster using hadoop?

2010-03-11 Thread Raymond Jennings III
I thought there was a util to do the upgrade for you that you run from one node 
and it would do a copy to every other node?


  


SEQ

2010-03-09 Thread Raymond Jennings III
Are there any examples that show how to create a SEQ file in HDFS ?


  


Anyone use MapReduce for TSP approximations?

2010-03-02 Thread Raymond Jennings III
I am interested in seeing how mapreduce could be used to approximate the 
traveling salesman problem.  Anyone have a pointer?

Thanks.


  


How do I get access to the Reporter within Mapper?

2010-02-24 Thread Raymond Jennings III
I am using the non-deprecated Mapper.  Can I obtain it from the Context 
somehow?  Anyone have an example of this?  Thanks.


  


Is it possible to run multiple mapreduce jobs from within the same application

2010-02-23 Thread Raymond Jennings III
In other words:  I have a situation where I want to feed the output from the 
first iteration of my mapreduce job to a second iteration and so on.  I have a 
for loop in my main method to setup the job parameters and to run it through 
all iterations but on about the third run the Hadoop processes lose their 
association with the 'jps' command and then weird things start happening.  I 
remember reading somewhere about chaining - is that what is needed?  I'm not 
sure what causes jps to not report the hadoop processes even though they are 
still active as can be seen with the ps command.  Thanks.  (This is on 
version 0.20.1)


  


Question about Join.java example

2010-02-17 Thread Raymond Jennings III
Is there a typo in the Join.java example that comes with hadoop?  It has the 
line:

JobConf jobConf = new JobConf(getConf(), Sort.class);

Shouldn't that be Join.class ?  Is there an equivalent example that uses the 
later API instead of the deprecated calls?


  


Re: Need to re replicate

2010-01-27 Thread Raymond Jennings III
I would try running the rebalance utility.  I would be curious to see what that 
will do and if that will fix it.

--- On Wed, 1/27/10, Ananth T. Sarathy ananth.t.sara...@gmail.com wrote:

 From: Ananth T. Sarathy ananth.t.sara...@gmail.com
 Subject: Need to re replicate
 To: common-user@hadoop.apache.org
 Date: Wednesday, January 27, 2010, 9:28 PM
 One of our datanodes went bye bye. We
 added a bunch more data nodes, but
 when I do a fsck i get a report that a bunch of files are
 only replicated on
 2 server, which makes sense, because we had 3, and lost
 one. Now that we
 have 6 more, is there anything i need to do replicate the
 those files are
 will the cluster fix itself?
 Ananth
 


  


Re: Passing whole text file to a single map

2010-01-23 Thread Raymond Jennings III
Not sure if this solves your problem but I had a similar case where there was 
unique data at the beginning of the file and if that file was split between 
maps I would lose that for the 2nd and subsequent maps.  I was able to pull the 
file name from the conf and read the first two lines for every map.

--- On Sat, 1/23/10, stolikp stol...@o2.pl wrote:

 From: stolikp stol...@o2.pl
 Subject: Passing whole text file to a single map
 To: core-u...@hadoop.apache.org
 Date: Saturday, January 23, 2010, 9:49 AM
 
 I've got some text files in my input directory and I want
 to pass each single
 text file (whole file not just a line) to a map (one file
 per one map). How
 can I do this ? TextInputFormat splits text into lines and
 I do not want
 this to happen.
 I tried:
 http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
 but it doesn't work for me, compiler doesn't know what
 NonSplitableTextInputFormat.class is.
 I'm using hadoop 0.20.1 
 -- 
 View this message in context: 
 http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html
 Sent from the Hadoop core-user mailing list archive at
 Nabble.com.
 
 


  


Re: Google has obtained the patent over mapreduce

2010-01-20 Thread Raymond Jennings III
I am not a patent attorney either but for what it's worth - many times a patent 
is sought solely to protect a company from being sued from another.  So even 
though Hadoop is out there it could be the case that Google has no intent of 
suing anyone who uses it - they just wanted to protect themselves from someone 
else claiming it as their own and then suing Google.  But yes, the patent 
system clearly has problems as you stated.

--- On Wed, 1/20/10, Edward Capriolo edlinuxg...@gmail.com wrote:

 From: Edward Capriolo edlinuxg...@gmail.com
 Subject: Re: Google has obtained the patent over mapreduce
 To: common-user@hadoop.apache.org
 Date: Wednesday, January 20, 2010, 12:09 PM
 Interesting situation.
 
 I try to compare mapreduce to the camera. Let argue Google
 is Kodak,
 Apache is Polaroid, and MapReduce is a Camera. Imagine
 Kodak invented
 the camera privately, never sold it to anyone, but produced
 some
 document describing what a camera did.
 
 Polaroid followed the document and produced a camera and
 sold it
 publicly. Kodak later patents a camera, even though no one
 outside of
 Kodak can confirm Kodak ever made a camera before
 Polaroid.
 
 Not saying that is what happened here, but google releasing
 the GFS
 pdf was a large factor in causing hadoop to happen.
 Personally, it
 seems like they gave away too much information before they
 had the
 patent.
 
 The patent system faces many problems including this 'back
 to the
 future' issue. Where it takes so long to get a patent no
 one can wait,
 by the time a patent is issued there are already multiple
 viable
 implementations of a patent.
 
 I am no patent layer or anything, but I notice the phrase
 master
 process all over the claims. Maybe if a piece of software
 (hadoop)
 had a distributed process that would be sufficient to say
 hadoop
 technology does not infringe on this patent.
 
 I think it would be interesting to look deeply at each
 claim and
 determine if hadoop could be designed to not infringe on
 these
 patents, to deal with what if scenarios.
 
 
 
 On Wed, Jan 20, 2010 at 11:29 AM, Ravi ravindra.babu.rav...@gmail.com
 wrote:
  Hi,
   I too read about that news. I don't think that it
 will be any problem.
  However Google didn't invent the model.
 
  Thanks.
 
  On Wed, Jan 20, 2010 at 9:47 PM, Udaya Lakshmi udaya...@gmail.com
 wrote:
 
  Hi,
    As an user of hadoop, Is there anything to
 worry about Google obtaining
  the patent over mapreduce?
 
  Thanks.
 
 
 





Obtaining name of file in map task

2010-01-12 Thread Raymond Jennings III
I am trying to determine what the name of the file that is being used for the 
map task.  I am trying to use the setup() method to read the input file with:

public void setup(Context context) {

Configuration conf = context.getConfiguration();
String inputfile = conf.get(map.input.file);
..

But inputfile is always null.  Anyone have a pointer on how to do this?  Thanks.


  


Re: Is it possible to share a key across maps?

2010-01-12 Thread Raymond Jennings III
Hi Gang, 
I was able to use this on an older version that uses the JobClient class to run 
the job but not on the newer api with the Job class.  The Job class appears to 
use a setup() method instead of a configure() method but the map.input.file 
attribute does not appear to be available via the conf class the setup() 
method.  Have you tried to do what you described using the newer api?  Thank 
you.

--- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn wrote:

 From: Gang Luo lgpub...@yahoo.com.cn
 Subject: Re: Is it possible to share a key across maps?
 To: common-user@hadoop.apache.org
 Date: Friday, January 8, 2010, 10:03 PM
 I don't do that in map method, but in
 configure( JobConf ) method which runs ahead of any map
 method call in that map task. JobConf.get(map.input.file)
 can tell you which file this map task is processing. Use
 this path to read first line of corresponding file. All
 these are done in configure method, that means, before any
 map method is called.
 
 
 -Gang
 
 
 
 - 原始邮件 
 发件人: Raymond Jennings III raymondj...@yahoo.com
 收件人: common-user@hadoop.apache.org
 发送日期: 2010/1/8 (周五) 7:54:30 下午
 主   题: Re: Is it possible to share a
 key across maps?
 
 Hi, you do this in the map method (open the file and read
 the first line?)  Could you explain a little more how
 you do it with configure(), thank you.
 
 --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn
 wrote:
 
  From: Gang Luo lgpub...@yahoo.com.cn
  Subject: Re: Is it possible to share a key across
 maps?
  To: common-user@hadoop.apache.org
  Date: Friday, January 8, 2010, 4:46 PM
  I will do that like this: at each map
  task, I get the input file to
  this mapper in the configure(), and manually read the
 first
  line of
  that file to get the user ID. Then start running the
 map
  function.
  
  
  -Gang
  
  
  - 原始邮件 
  发件人: Raymond Jennings III raymondj...@yahoo.com
  收件人: common-user@hadoop.apache.org
  发送日期: 2010/1/8 (周五) 4:23:15 下午
  主   题: Is it possible to share a
 key
  across maps?
  
  I have large files where the userid is the first line
 of
  each file.  I want to use that value as the
 output of
  the map phase for each subsequent line of the
 file.  If
  each map task gets a chunk of this file only one map
 task
  will read the key value from the first line.  Is
 there
  anyway I can force the other map tasks to wait until
 this
  key is read and then somehow pass this value to other
 map
  tasks?  Or is my reasoning incorrect? 
 Thanks.
  
  
       
 
 ___
  
    好玩贺卡等你发,邮箱贺卡全新上线!
  
  http://card.mail.cn.yahoo.com/
  
 
 
      
 ___
 
   好玩贺卡等你发,邮箱贺卡全新上线!
 
 http://card.mail.cn.yahoo.com/
 





Re: Is it possible to share a key across maps?

2010-01-11 Thread Raymond Jennings III
It looks like what you are referring to is the deprecated class - which has 
made for some confusing conversations in the past.  It seems like many users 
still use the older API and most of the examples still use it.  I would like to 
stay with the more recent api which looks the call is actually setup() 
instead of configure().  Not sure if it's a one to one mapping though.

--- On Fri, 1/8/10, Jeff Zhang zjf...@gmail.com wrote:

 From: Jeff Zhang zjf...@gmail.com
 Subject: Re: Is it possible to share a key across maps?
 To: common-user@hadoop.apache.org
 Date: Friday, January 8, 2010, 11:15 PM
 Actually you can treat the mapper
 task as a template design pattern, here's
 the persuade code:
 
 Mapper.configure(JobConf)
 for each record in InputSplit:
       do
 Mapper.map(key,value,outputkey,outputvalue)
 Mapper.close()
 
 Any sub class of mapper can override the three method:
 configure(),
 map(),close() to do customization.
 
 
 
 2010/1/8 Gang Luo lgpub...@yahoo.com.cn
 
  I don't do that in map method, but in configure(
 JobConf ) method which
  runs ahead of any map method call in that map task.
  JobConf.get(map.input.file) can tell you which file
 this map task is
  processing. Use this path to read first line of
 corresponding file. All
  these are done in configure method, that means, before
 any map method is
  called.
 
 
  -Gang
 
 
 
  - 原始邮件 
  发件人: Raymond Jennings III raymondj...@yahoo.com
  收件人: common-user@hadoop.apache.org
  发送日期: 2010/1/8 (周五) 7:54:30 下午
  主   题: Re: Is it possible to
 share a key across maps?
 
  Hi, you do this in the map method (open the file and
 read the first line?)
   Could you explain a little more how you do it
 with configure(), thank you.
 
  --- On Fri, 1/8/10, Gang Luo lgpub...@yahoo.com.cn
 wrote:
 
   From: Gang Luo lgpub...@yahoo.com.cn
   Subject: Re: Is it possible to share a key across
 maps?
   To: common-user@hadoop.apache.org
   Date: Friday, January 8, 2010, 4:46 PM
   I will do that like this: at each map
   task, I get the input file to
   this mapper in the configure(), and manually read
 the first
   line of
   that file to get the user ID. Then start running
 the map
   function.
  
  
   -Gang
  
  
   - 原始邮件 
   发件人: Raymond Jennings III raymondj...@yahoo.com
   收件人: common-user@hadoop.apache.org
   发送日期: 2010/1/8 (周五) 4:23:15 下午
   主   题: Is it possible to
 share a key
   across maps?
  
   I have large files where the userid is the first
 line of
   each file.  I want to use that value as the
 output of
   the map phase for each subsequent line of the
 file.  If
   each map task gets a chunk of this file only one
 map task
   will read the key value from the first
 line.  Is there
   anyway I can force the other map tasks to wait
 until this
   key is read and then somehow pass this value to
 other map
   tasks?  Or is my reasoning incorrect? 
 Thanks.
  
  
  
  
 ___
  
 
    好玩贺卡等你发,邮箱贺卡全新上线!
  
   http://card.mail.cn.yahoo.com/
  
 
 
    
    ___
    好玩贺卡等你发,邮箱贺卡全新上线!
  http://card.mail.cn.yahoo.com/
 
 
 
 
 -- 
 Best Regards
 
 Jeff Zhang
 





Can map reduce methods print to console in eclipse?

2010-01-11 Thread Raymond Jennings III
I tried writing to stderr but I guess that is not valid.  Can someone tell me 
how I can output some text during either the map or reduce methods?


  


Is it possible to share a key across maps?

2010-01-08 Thread Raymond Jennings III
I have large files where the userid is the first line of each file.  I want to 
use that value as the output of the map phase for each subsequent line of the 
file.  If each map task gets a chunk of this file only one map task will read 
the key value from the first line.  Is there anyway I can force the other map 
tasks to wait until this key is read and then somehow pass this value to other 
map tasks?  Or is my reasoning incorrect?  Thanks.


  


Other sources for hadoop api help

2010-01-07 Thread Raymond Jennings III
I am trying to develop some hadoop programs and I see that most of the examples 
included in the distribution are using deprecated classes and methods.  Are 
there any other sources to learn about the api other than the javadocs, which 
for beginners trying to write hadoop programs, is not the best source.  Thanks.



  


Jobs stop at 0%

2009-12-24 Thread Raymond Jennings III
I have been recently seeing a problem where jobs stop at map 0% that previously 
worked fine (with no code changes.)  Restarting hadoop on the cluster solves 
this problem but there is nothing in the log files to indicate what the problem 
is.  Has anyone seen something similar?


  


Errors seen on the jobtracker node

2009-12-18 Thread Raymond Jennings III
Does anyone have any idea what might be causing the following three errors that 
I am seeing.  I am not able to determine what job or what was happening at the 
times listed but I am hoping that if I have a little more information I can 
track down what is happening:

hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-11-11:2009-11-11 11:38:57,720 
ERROR org.apache.hadoop.mapred.JobHistory: Failed creating job history log 
file, disabling history

hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-11-11:2009-11-11 11:38:57,782 
ERROR org.apache.hadoop.mapred.JobHistory: Failed to store job conf on the 
local filesystem 

hadoop-root-jobtracker-pingo-2.poly.edu.log.2009-12-13:2009-12-13 22:30:04,495 
ERROR org.apache.hadoop.hdfs.DFSClient: Exception closing file . by 
DFSClient_-340809610





  


Combiner phase question

2009-12-04 Thread Raymond Jennings III
Does the combiner run once per data node or one per map task?  (That it can run 
multiple times on the same data node after each map task.)  Thanks.


  


Good idea to run NameNode and JobTracker on same machine?

2009-11-26 Thread Raymond Jennings III
Do people normally combine these two processes onto one machine?  Currently I 
have them on separate machines but I am wondering they use that much CPU 
processing time and maybe I should combine them and create another DataNode.


  


Has anyone gotten the Hadoop eclipse plugin to work on Windows?

2009-11-21 Thread Raymond Jennings III
I have been pulling my hair out on this one.  I tried building it within 
eclipse - no errors, but when I put the jar file in and restart eclipse I can 
see the Map/Reduce prospective but once I try to do anything it bombs with 
random cryptic errors.  I looked at Stephen's notes on jiva but still no go.  I 
am desperate to get this working so bribes, kick-backs other reciprocity will 
be gladly considered.  ;-)  Thanks!

Ray


  


build / install hadoop plugin question

2009-11-20 Thread Raymond Jennings III
The plugin that is included in the hadoop distribution under 
src/contrib/eclipse-plugin - how does that get installed as it does not appear 
to be in a standard plugin format.  Do I have to build it first and if so can 
you tell me how.  Thanks.  Ray


  


Re: build / install hadoop plugin question

2009-11-20 Thread Raymond Jennings III
That's what I would normally do for a plugin but this has a sub-directory of 
eclipse-plugin (and not plugins) and the files are all java files and not 
class files.  This in the hadoop directory of src/contrib/eclipse-plugin.  It 
looks to me like it has to be built first and then copied into the plugins 
directory?

--- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com wrote:

 From: Dhaivat Pandit ceo.co...@gmail.com
 Subject: Re: build / install hadoop plugin question
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date: Friday, November 20, 2009, 9:05 PM
 Just paste it in eclipse installation
 plugins folder and restart eclipse
 
 -dp
 
 
 On Nov 20, 2009, at 2:08 PM, Raymond Jennings III raymondj...@yahoo.com
 wrote:
 
  The plugin that is included in the hadoop distribution
 under src/contrib/eclipse-plugin - how does that get
 installed as it does not appear to be in a standard plugin
 format.  Do I have to build it first and if so can you
 tell me how.  Thanks.  Ray
  
  
  
 


 


Re: build / install hadoop plugin question

2009-11-20 Thread Raymond Jennings III
Could you explain further on how to do this.  I have never built a plugin 
before.  Do I do this from within eclipse?  Thanks!

--- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com wrote:

 From: Dhaivat Pandit ceo.co...@gmail.com
 Subject: Re: build / install hadoop plugin question
 To: common-user@hadoop.apache.org common-user@hadoop.apache.org
 Date: Friday, November 20, 2009, 9:53 PM
 Yes if it's not built you can do ant
 eclipse. It will geerate the plugin jar and you can paste
 it in plugin directory.
 
 -dp
 
 
 On Nov 20, 2009, at 6:49 PM, Raymond Jennings III raymondj...@yahoo.com
 wrote:
 
  That's what I would normally do for a plugin but this
 has a sub-directory of eclipse-plugin (and not plugins)
 and the files are all java files and not class files. 
 This in the hadoop directory of
 src/contrib/eclipse-plugin.  It looks to me like it has
 to be built first and then copied into the plugins
 directory?
  
  --- On Fri, 11/20/09, Dhaivat Pandit ceo.co...@gmail.com
 wrote:
  
  From: Dhaivat Pandit ceo.co...@gmail.com
  Subject: Re: build / install hadoop plugin
 question
  To: common-user@hadoop.apache.org
 common-user@hadoop.apache.org
  Date: Friday, November 20, 2009, 9:05 PM
  Just paste it in eclipse installation
  plugins folder and restart eclipse
  
  -dp
  
  
  On Nov 20, 2009, at 2:08 PM, Raymond Jennings III
 raymondj...@yahoo.com
  wrote:
  
  The plugin that is included in the hadoop
 distribution
  under src/contrib/eclipse-plugin - how does that
 get
  installed as it does not appear to be in a
 standard plugin
  format.  Do I have to build it first and if
 so can you
  tell me how.  Thanks.  Ray
  
  
  
  
  
  
  
 





Can I change the block size and then restart?

2009-11-19 Thread Raymond Jennings III
Can I just change the block size in the config and restart or do I have to 
reformat?  It's okay if what is currently in the file system stays at the old 
block size if that's possible ?


  


Re: About Hadoop pseudo distribution

2009-11-12 Thread Raymond Jennings III
If I understand you correctly you can run jps and see the java jvm's running 
on each machine - that should tell you if you are running in pseudo mode or not.

--- On Thu, 11/12/09, kvorion kveinst...@gmail.com wrote:

 From: kvorion kveinst...@gmail.com
 Subject: About Hadoop pseudo distribution
 To: core-u...@hadoop.apache.org
 Date: Thursday, November 12, 2009, 12:02 PM
 
 Hi All,
 
 I have been trying to set up a hadoop cluster on a number
 of machines, a few
 of which are multicore machines. I have been wondering
 whether the hadoop
 pseudo distribution is something that can help me take
 advantage of the
 multiple cores on my machines. All the tutorials say that
 the pseudo
 distribution mode lets you start each daemon in a separate
 java process. I
 have the following configuration settings for
 hadoop-site.xml:
 
 property
   namefs.default.name/name
   valuehdfs://athena:9000/value
 /property
 
 property
   namemapred.job.tracker/name
   valueathena:9001/value
 /property
 
 property
   namedfs.replication/name
   value2/value
 /property
 
 I am not sure if this is really running in the
 pseudo-distribution mode. Are
 there any indicators or outputs that confirm what mode you
 are running in?
 
 
 -- 
 View this message in context: 
 http://old.nabble.com/About-Hadoop-pseudo-distribution-tp26322382p26322382.html
 Sent from the Hadoop core-user mailing list archive at
 Nabble.com.
 
 





User permissions on dfs ?

2009-11-11 Thread Raymond Jennings III
Is there a way that I can setup directories in dfs for individual users and set 
the permissions such that only that user can read write such that if I do a 
hadoop dfs -ls I would get /user/user1 /user/user2  etc each directory only 
being able to read and write to by the respective user?  I don't want to format 
an entire dfs filesystem for each user just let them have one sub-directory off 
of the main /users dfs directory that only they (and root) can read and write 
to.

Right now if I run a mapreduce app as any user but root I am unable to save the 
intermediate files in dfs.

Thanks!


  


Re: User permissions on dfs ?

2009-11-11 Thread Raymond Jennings III
Ah okay, I was looking at the options for hadoop and it only shows fs and not 
dfs - now that I realize they are one in the same.  Thanks!

--- On Wed, 11/11/09, Allen Wittenauer awittena...@linkedin.com wrote:

 From: Allen Wittenauer awittena...@linkedin.com
 Subject: Re: User permissions on dfs ?
 To: common-user@hadoop.apache.org
 Date: Wednesday, November 11, 2009, 1:59 PM
 
 
 
 On 11/11/09 8:50 AM, Raymond Jennings III raymondj...@yahoo.com
 wrote:
 
  Is there a way that I can setup directories in dfs for
 individual users and
  set the permissions such that only that user can read
 write such that if I do
  a hadoop dfs -ls I would get /user/user1
 /user/user2  etc each directory
  only being able to read and write to by the respective
 user?  I don't want to
  format an entire dfs filesystem for each user just let
 them have one
  sub-directory off of the main /users dfs directory
 that only they (and root)
  can read and write to.
  
  Right now if I run a mapreduce app as any user but
 root I am unable to save
  the intermediate files in dfs.
 
 
 A) Don't run Hadoop as root.  All of your user
 submitted code will also run
 as root. This is bad. :)
 
 B) You should be able to create user directories:
 
 hadoop dfs -mkdir /user/username
 hadoop dfs -chown username /user/username
 ...
 
 C) If you are attempting to run pig (and some demos), it
 has a dependency on
 a world writable /tmp. :(
 
 hadoop dfs -mkdir /tmp
 hadoop dfs -chmod a+w /tmp
 
 D) If you are on Solaris, whoami isn't in the default path.
 This confuses
 the hell out of Hadoop so you may need to hack all your
 machines to make
 Hadoop happy here.
 
 
 





Error with replication and namespaceID

2009-11-10 Thread Raymond Jennings III
On the actual datanodes I see the following exception:  I am not sure what the 
namespaceID is or how to sync them.  Thanks for any advice!



/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = pingo-3.poly.edu/128.238.55.33
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.1
STARTUP_MSG:   build = 
http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1 -r 
810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC 2009
/
2009-11-09 09:57:45,328 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: 
namenode namespaceID = 1016244663; datanode namespaceID = 1687029285
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)


--- On Mon, 11/9/09, Boris Shkolnik bo...@yahoo-inc.com wrote:

 From: Boris Shkolnik bo...@yahoo-inc.com
 Subject: Re: newbie question - error with replication
 To: common-user@hadoop.apache.org
 Date: Monday, November 9, 2009, 5:02 PM
 Make sure you have at least one
 datanode running.
 Look at the data node log file. (logs/*-datanode-*.log)
 
 Boris.
 
 
 On 11/9/09 7:15 AM, Raymond Jennings III raymondj...@yahoo.com
 wrote:
 
  I am trying to resolve an IOException error.  I
 have a basic setup and shortly
  after running start-dfs.sh I get a:
  
  error: java.io.IOException: File
  /tmp/hadoop-root/mapred/system/jobtracker.info could
 only be replicated to 0
  nodes, instead of 1
  java.io.IOException: File
 /tmp/hadoop-root/mapred/system/jobtracker.info could
  only be replicated to 0 nodes, instead of 1
  
  Any pointers how to resolve this?  Thanks!
  
  
  
        
 
 





Re: Error with replication and namespaceID

2009-11-10 Thread Raymond Jennings III
Thanks!!!  That worked!  I guess I can edit the number on the datanodes as well 
but if there is an even more official way to resolve this I would be 
interested in hearing about it.

--- On Tue, 11/10/09, Edmund Kohlwey ekohl...@gmail.com wrote:

 From: Edmund Kohlwey ekohl...@gmail.com
 Subject: Re: Error with replication and namespaceID
 To: common-user@hadoop.apache.org
 Date: Tuesday, November 10, 2009, 1:46 PM
 Hi Ray,
 You'll probably find that even though the name node starts,
 it doesn't 
 have any data nodes and is completely empty.
 
 Whenever hadoop creates a new filesystem, it assigns a
 large random 
 number to it to prevent you from mixing datanodes from
 different 
 filesystems on accident. When you reformat the name node
 its FS has one 
 ID, but your data nodes still have chunks of the old FS
 with a different 
 ID and so will refuse to connect to the namenode. You need
 to make sure 
 these are cleaned up before reformatting. You can do it
 just by deleting 
 the data node directory, although there's probably a more
 official way 
 to do it.
 
 
 On 11/10/09 11:01 AM, Raymond Jennings III wrote:
  On the actual datanodes I see the following
 exception:  I am not sure what the namespaceID is or
 how to sync them.  Thanks for any advice!
 
 
 
 
 /
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host =
 pingo-3.poly.edu/128.238.55.33
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.1
  STARTUP_MSG:   build = 
  http://svn.apache.org/repos/asf/hadoop/common/tags/release-0.20.1-rc1
 -r 810220; compiled by 'oom' on Tue Sep  1 20:55:56 UTC
 2009
 
 /
  2009-11-09 09:57:45,328 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode:
 java.io.IOException: Incompatible namespaceIDs in
 /tmp/hadoop-root/dfs/data: namenode namespaceID =
 1016244663; datanode namespaceID = 1687029285
           at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
           at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
           at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
 
  --- On Mon, 11/9/09, Boris Shkolnikbo...@yahoo-inc.com 
 wrote:
 
     
  From: Boris Shkolnikbo...@yahoo-inc.com
  Subject: Re: newbie question - error with
 replication
  To: common-user@hadoop.apache.org
  Date: Monday, November 9, 2009, 5:02 PM
  Make sure you have at least one
  datanode running.
  Look at the data node log file.
 (logs/*-datanode-*.log)
 
  Boris.
 
 
  On 11/9/09 7:15 AM, Raymond Jennings IIIraymondj...@yahoo.com
  wrote:
 
       
  I am trying to resolve an IOException
 error.  I
         
  have a basic setup and shortly
       
  after running start-dfs.sh I get a:
 
  error: java.io.IOException: File
  /tmp/hadoop-root/mapred/system/jobtracker.info
 could
         
  only be replicated to 0
       
  nodes, instead of 1
  java.io.IOException: File
         
  /tmp/hadoop-root/mapred/system/jobtracker.info
 could
       
  only be replicated to 0 nodes, instead of 1
 
  Any pointers how to resolve this? 
 Thanks!
 
 
 
 
         
 
       
 
 
     
 
 





newbie question - error with replication

2009-11-09 Thread Raymond Jennings III
I am trying to resolve an IOException error.  I have a basic setup and shortly 
after running start-dfs.sh I get a:

error: java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info 
could only be replicated to 0 nodes, instead of 1
java.io.IOException: File /tmp/hadoop-root/mapred/system/jobtracker.info could 
only be replicated to 0 nodes, instead of 1

Any pointers how to resolve this?  Thanks!