Re: How does an offline Datanode come back up ?

2008-10-28 Thread Steve Loughran
wmitchell wrote: Hi All, Ive been working michael nolls multi-node cluster setup example (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I then on my slave machine -- which is currently running a datanode killed the process in an effort to try to simulate some sort of

Re: namenode failure

2008-10-28 Thread Alex Loddengaard
Manually killing a process might create a situation where only a portion of your data is written to disk, and other data in queue to be written is lost. This is what has most likely caused corruption in your name node. Start by reading about bin/hadoop namenode -fsck:

Understanding file splits

2008-10-28 Thread Malcolm Matalka
I am trying to write an InputFormat and I am having some trouble understanding how my data is being broken up. My input is a previous hadoop job and I have added code to my record reader to print out the FileSplit's start and end position, as well as where the last record I read was located. My

SecondaryNameNode on separate machine

2008-10-28 Thread Tomislav Poljak
Hi, I'm trying to implement NameNode failover (or at least NameNode local data backup), but it is hard since there is no official documentation. Pages on this subject are created, but still empty: http://wiki.apache.org/hadoop/NameNodeFailover http://wiki.apache.org/hadoop/SecondaryNameNode I

I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread David M. Coe
I am attempting to write a map/reduce that will sort by the key and then by the values. The output should look like: 0 0 0 1 0 5 0 123 0 89245 1 0 1 234 1 23423 My mapper is MapperLongWritable, Text, IntWritable, IntWritable and my reducer is the identity. I configure the program using:

Re: Understanding file splits

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 6:29 AM, Malcolm Matalka wrote: I am trying to write an InputFormat and I am having some trouble understanding how my data is being broken up. My input is a previous hadoop job and I have added code to my record reader to print out the FileSplit's start and end position,

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 7:53 AM, David M. Coe wrote: My mapper is MapperLongWritable, Text, IntWritable, IntWritable and my reducer is the identity. I configure the program using: conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class);

Re: Understanding file splits

2008-10-28 Thread Doug Cutting
This is hard to diagnose without knowing your InputFormat. Each split returned by your #getSplits() implementation is passed to your #getRecordReader() implementation. If your RecordReader is not stopping when you expect it to, then that's a problem in your RecordReader, no? Have you written

Question about: file system lockups with xfs, hadoop 0.16.3 and linux 2.6.18-92.1...PAE i686

2008-10-28 Thread Jason Venner
We are seeing some strange lockups on a couple of our machines (in multiple clusters) Basically the hadoop processes will hang on the machine (datanode, tasktracker and tasktracker$child). And if you happen to tail the log files the tail will hang, if you do a find in the dfs data directory

RE: Understanding file splits

2008-10-28 Thread Malcolm Matalka
Thanks for the response Owen. As for the 'isSplittable' thing. The FAQ calls this function 'isSplittable' but in the API it is actually 'isSplitable'. I am not sure who to contact to fix the FAQ. I am extending FileInputFormat in this case so it was actually returning true. In this case the

RE: Understanding file splits

2008-10-28 Thread Malcolm Matalka
Thanks Doug. I have written my RecordReader from scratch. I used LineRecordReader as a template. In my response to Owen I showed that if I set isSplitable to false I get splits that represent my entire input file but I am only able to read up to the 67108800 byte (which I believe is a block).

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread Hien Luu
This is nice feature for sorting keys and values. Is there more documentation somewhere that I can find? or is there a MapReduce example that uses this feature? Thanks, Hien From: Owen O'Malley [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent:

RE: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Deepika Khera
I am getting a similar exception too with Hadoop 0.18.1(See stacktrace below), though its an EOFException. Does anyone have any idea about what it means and how it can be fixed? 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200810241922_0844_r_06_0 Merge of the

Re: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Arun C Murthy
On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote: Hi, Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with some of our clustering code and Hadoop 0.18.1. The thread in context is at: http://mahout.markmail.org/message/vcyvlz2met7fnthr The problem seems to occur when

sorting inputs to reduce tasks

2008-10-28 Thread Mark Tozzi
Greetings Hadoop users, I'm relatively new to MapReduce (I've been working on my own with the Hadoop code for about a month and a half now), and I'm having difficulty with how the values for a given key are passed to the reducer. As per the API, the reducer expects a single Key and an

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Doug Balog
Hi Alex, I'm sorry, I think you misunderstood my question. Let me explain some more. I have a hadoop cluster of dual quad core machines. I'm using hadoop-0.18.1 with Matei's fairscheduler patch https://issues.apache.org/jira/browse/HADOOP-3746 running in FIFO mode. I have about 5 different

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Alex Loddengaard
I understand your question now, Doug; thanks for clarifying. However, I don't think I can give you a great answer. I'll give it a shot, though: It does seem like having a single task configuration in theory would improve utilization, but it might also make things worse. For example, generally

Re: Ideal number of mappers and reducers; any physical limits?

2008-10-28 Thread Edward J. Yoon
Hi, I'm interested in graph algorithms. In single machine, as far as we know graph can be stored to linked list or matrix. Do you know about difference benefit between linked list and matrix? So, I guess google's web graph will be stored as a matrix in a bigTable. Have you seen my 2D block

Re: SecondaryNameNode on separate machine

2008-10-28 Thread Jean-Daniel Cryans
Tomislav. Contrary to popular belief the secondary namenode does not provide failover, it's only used to do what is described here : http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode So the term secondary does not mean a second one but is more like a second part

One key per output file

2008-10-28 Thread Florian Leibert
Hi, for convenience reasons, I was wondering if there is a simple way to produce one output file per key in the Reducer? Thanks, Florian

Re: help: InputFormat problem ?

2008-10-28 Thread ZhiHong Fu
I'm a little confused about the implemention of DBInputFormat. In my view , The method getSplits of DBInputFormat splits the resultset into serval splits logically. so The DbRecordReader should process the DbSplit. But I find in the real implement of DbRecordReader It process the resultset

Re: How does an offline Datanode come back up ?

2008-10-28 Thread Norbert Burger
Along these lines, I'm curious what management tools folks are using to ensure cluster availability (ie., auto-restart failed datanodes/namenodes). Are you using a custom cron script, or maybe something more complex (Ganglia, Nagios, puppet, etc.)? Thanks, Norbert On 10/28/08, Steve Loughran

Re: How does an offline Datanode come back up ?

2008-10-28 Thread David Wei
I think using cron tab will be a good solution. Just using a test script to ensure the living processes and restart them when they are down. Norbert Burger 写道: Along these lines, I'm curious what management tools folks are using to ensure cluster availability (ie., auto-restart failed

Re: Merge of the inmemory files threw an exception and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Devaraj Das
Quick question (I haven't looked at your comparator code yet) - is this reproducible/consistent? On 10/28/08 11:52 PM, Deepika Khera [EMAIL PROTECTED] wrote: I am getting a similar exception too with Hadoop 0.18.1(See stacktrace below), though its an EOFException. Does anyone have any idea

Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-28 Thread Amareshwari Sriramadasu
Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: Hi, I implemented customized classes for

Re: One key per output file

2008-10-28 Thread Florian Leibert
Thanks Mice, tried using that already - however this doesn't yield the desired results - upon output collection (using the OutputCollector), it still produces only one output file (note, I only have one input file, not multiple input files, but want a file per key for the output...)

Re: One key per output file

2008-10-28 Thread Mice
Did you override generateFileNameForKeyValue? 2008/10/29 Florian Leibert [EMAIL PROTECTED]: Thanks Mice, tried using that already - however this doesn't yield the desired results - upon output collection (using the OutputCollector), it still produces only one output file (note, I only have

Re: Ideal number of mappers and reducers; any physical limits?

2008-10-28 Thread Ted Dunning
On Tue, Oct 28, 2008 at 5:15 PM, Edward J. Yoon [EMAIL PROTECTED]wrote: ... In single machine, as far as we know graph can be stored to linked list or matrix. Since the matrix is normally very sparse for large graphs, these two approaches are pretty similar. ... So, I guess google's web

Re: One key per output file

2008-10-28 Thread Florian Leibert
Great, thanks for that hint - for some reason I expected that behavior to be a feature of the MultipleTextOutputFormat class - doing so solved my problem! Thanks!! Here my code (I wanted to specifically omit outputting the key however still having a file per key) if anyone is interested: