Re: Datanode not detecting full disk

2008-10-28 Thread Stefan Will
Hi Jeff, Yeah, it looks like I'm running into the issues described in the bug. I'm running 0.18.1 on CentOS 5 by the way. Measuring available disk space appears to be harder than I thought ... and here I was under the impression the percentage in df was a pretty clear indicator of how full the dis

Re: SecondaryNameNode on separate machine

2008-10-28 Thread Otis Gospodnetic
Hi, So what is the "recipe" for avoiding NN SPOF using only what comes with Hadoop? >From what I can tell, I think one has to do the following two things: 1) configure primary NN to save namespace and xa logs to multiple dirs, one of which is actually on a remotely mounted disk, so that the data

Re: Ideal number of mappers and reducers; any physical limits?

2008-10-28 Thread Edward J. Yoon
> extracting a block is only efficient if it is full width or height. For > very sparse matrix operations, the savings due to reuse of intermediate > results are completely dominated by the I/O cost so block decompositions are > much less helpful. Hmm, Yes. Thanks for your great comments. :-) >

Re: One key per output file

2008-10-28 Thread Florian Leibert
Great, thanks for that hint - for some reason I expected that behavior to be a feature of the MultipleTextOutputFormat class - doing so solved my problem! Thanks!! Here my code (I wanted to specifically omit outputting the key however still having a file per key) if anyone is interested:

Re: Ideal number of mappers and reducers; any physical limits?

2008-10-28 Thread Ted Dunning
On Tue, Oct 28, 2008 at 5:15 PM, Edward J. Yoon <[EMAIL PROTECTED]>wrote: > ... > In single machine, as far as we > know graph can be stored to linked list or matrix. > Since the matrix is normally very sparse for large graphs, these two approaches are pretty similar. > ... So, I guess google's

Re: One key per output file

2008-10-28 Thread Mice
Did you override generateFileNameForKeyValue? 2008/10/29 Florian Leibert <[EMAIL PROTECTED]>: > Thanks Mice, > tried using that already - however this doesn't yield the desired results - > upon output collection (using the OutputCollector), it still produces only > one output file (note, I only ha

Re: One key per output file

2008-10-28 Thread Florian Leibert
Thanks Mice, tried using that already - however this doesn't yield the desired results - upon output collection (using the OutputCollector), it still produces only one output file (note, I only have one input file, not multiple input files, but want a file per key for the output...) Thanks

Re: How do I include customized InputFormat, InputSplit and RecordReader in a C++ pipes job?

2008-10-28 Thread Amareshwari Sriramadasu
Hi, How are you passing your classes to the pipes job? If you are passing them as a jar file, you can use -libjars option. From branch 0.19, the libjar files are added to the client classpath also. Thanks Amareshwari Zhengguo 'Mike' SUN wrote: Hi, I implemented customized classes for InputF

Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Devaraj Das
Quick question (I haven't looked at your comparator code yet) - is this reproducible/consistent? On 10/28/08 11:52 PM, "Deepika Khera" <[EMAIL PROTECTED]> wrote: > I am getting a similar exception too with Hadoop 0.18.1(See stacktrace > below), though its an EOFException. Does anyone have any id

Re: How does an offline Datanode come back up ?

2008-10-28 Thread David Wei
I think using cron tab will be a good solution. Just using a test script to ensure the living processes and restart them when they are down. Norbert Burger 写道: Along these lines, I'm curious what "management tools" folks are using to ensure cluster availability (ie., auto-restart failed datan

Re: How does an offline Datanode come back up ?

2008-10-28 Thread Norbert Burger
Along these lines, I'm curious what "management tools" folks are using to ensure cluster availability (ie., auto-restart failed datanodes/namenodes). Are you using a custom cron script, or maybe something more complex (Ganglia, Nagios, puppet, etc.)? Thanks, Norbert On 10/28/08, Steve Loughran <

Re: help: InputFormat problem ?

2008-10-28 Thread ZhiHong Fu
I'm a little confused about the implemention of DBInputFormat. In my view , The method getSplits of DBInputFormat splits the resultset into serval splits logically. so The DbRecordReader should process the DbSplit. But I find in the real implement of DbRecordReader It process the resultset inste

Re: One key per output file

2008-10-28 Thread Mice
MultipleOutputFormat meets your need. It is in 0.18.1. 2008/10/29 Florian Leibert <[EMAIL PROTECTED]>: > Hi, > for convenience reasons, I was wondering if there is a simple way to produce > one output file per key in the Reducer? > > Thanks, > Florian >

Re: sorting inputs to reduce tasks

2008-10-28 Thread Mice
I didn't find any "secondary value sorter" either. You can play one workaround by joining your cache and Input with CompositeInputFormat if both of them are too large; but you need to sort both of them with equal partition before joining. There is another join util in contrib/data_join, it is not

One key per output file

2008-10-28 Thread Florian Leibert
Hi, for convenience reasons, I was wondering if there is a simple way to produce one output file per key in the Reducer? Thanks, Florian

Re: SecondaryNameNode on separate machine

2008-10-28 Thread Jean-Daniel Cryans
Tomislav. Contrary to popular belief the secondary namenode does not provide failover, it's only used to do what is described here : http://hadoop.apache.org/core/docs/r0.18.1/hdfs_user_guide.html#Secondary+NameNode So the term "secondary" does not mean "a second one" but is more like "a second p

Re: Ideal number of mappers and reducers; any physical limits?

2008-10-28 Thread Edward J. Yoon
Hi, I'm interested in graph algorithms. In single machine, as far as we know graph can be stored to linked list or matrix. Do you know about difference benefit between linked list and matrix? So, I guess google's web graph will be stored as a matrix in a bigTable. Have you seen my 2D block algori

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Doug Balog
Thanks Alex. I found a JIRA that relates to my question https://issues.apache.org/jira/browse/HADOOP-3420 If I decide to do something about this, I'll follow up with HADOOP-3420. Thanks, DougB On Oct 28, 2008, at 5:49 PM, Alex Loddengaard wrote: I understand your question now, Doug; thanks fo

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Doug Cutting
Alex Loddengaard wrote: That's the best I can do I think. Can others chime in? Another complicating factor is that, if a node dies, reduce tasks can be stalled waiting for map data to be re-generated. So if all tasks were scheduled out of a single pool, one would need to be careful to never

small feature request

2008-10-28 Thread Elia Mazzawi
it would be very useful if the web interface to the job tracker jobtracker.jsp showed the priority of each job somewhere next to it, I use this interface all the time, and when i have 10+ programs scheduled i have to keep clicking though them to see the priorities...

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Alex Loddengaard
I understand your question now, Doug; thanks for clarifying. However, I don't think I can give you a great answer. I'll give it a shot, though: It does seem like having a single task configuration in theory would improve utilization, but it might also make things worse. For example, generally sp

Re: Why separate Map/Reduce task limits per node ?

2008-10-28 Thread Doug Balog
Hi Alex, I'm sorry, I think you misunderstood my question. Let me explain some more. I have a hadoop cluster of dual quad core machines. I'm using hadoop-0.18.1 with Matei's fairscheduler patch https://issues.apache.org/jira/browse/HADOOP-3746 running in FIFO mode. I have about 5 different jobs

Re: LHadoop Server simple Hadoop input and output

2008-10-28 Thread Ariel Rabkin
Chukwa is not quite ready for prime time. The collection part works OK and shouldn't be too evil to set up, but the analysis part, and the data storage documentation isn't there yet. On Mon, Oct 27, 2008 at 12:51 PM, Jeff Hammerbacher <[EMAIL PROTECTED]> wrote: > It could, but we have been unable

sorting inputs to reduce tasks

2008-10-28 Thread Mark Tozzi
Greetings Hadoop users, I'm relatively new to MapReduce (I've been working on my own with the Hadoop code for about a month and a half now), and I'm having difficulty with how the values for a given key are passed to the reducer. As per the API, the reducer expects a single Key and an iterato

RE: Understanding file splits

2008-10-28 Thread Malcolm Matalka
I did a test where I created a free-standing java application that will just open up one of the URI's and try to read all of it, just as I do in the RecordReader. This worked fine and successfully read the entire file. The M/R job seems to be getting EOF at the end of the first block though. I a

Re: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Arun C Murthy
On Oct 27, 2008, at 7:05 PM, Grant Ingersoll wrote: Hi, Over in Mahout (lucene.a.o/mahout), we are seeing an oddity with some of our clustering code and Hadoop 0.18.1. The thread in context is at: http://mahout.markmail.org/message/vcyvlz2met7fnthr The problem seems to occur when going

RE: "Merge of the inmemory files threw an exception" and diffs between 0.17.2 and 0.18.1

2008-10-28 Thread Deepika Khera
I am getting a similar exception too with Hadoop 0.18.1(See stacktrace below), though its an EOFException. Does anyone have any idea about what it means and how it can be fixed? 2008-10-27 16:53:07,407 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200810241922_0844_r_06_0 Merge of the inme

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread Hien Luu
This is nice feature for sorting keys and values. Is there more documentation somewhere that I can find? or is there a MapReduce example that uses this feature? Thanks, Hien From: Owen O'Malley <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Tues

Re: Still find this problem! -_-!!!

2008-10-28 Thread Jason Venner
We have this problem when the configuration object being used has been configured using a different hadoop-site.xml/hadoop-default.xml than we expect. It usually comes down to a coding error in the setup of the Configuration object or a system adminestration error in the setup of the hadoop-site.x

RE: Understanding file splits

2008-10-28 Thread Malcolm Matalka
Thanks Doug. I have written my RecordReader from scratch. I used LineRecordReader as a template. In my response to Owen I showed that if I set isSplitable to false I get splits that represent my entire input file but I am only able to read up to the 67108800 byte (which I believe is a block).

RE: Understanding file splits

2008-10-28 Thread Malcolm Matalka
Thanks for the response Owen. As for the 'isSplittable' thing. The FAQ calls this function 'isSplittable' but in the API it is actually 'isSplitable'. I am not sure who to contact to fix the FAQ. I am extending FileInputFormat in this case so it was actually returning true. In this case the out

Question about: file system lockups with xfs, hadoop 0.16.3 and linux 2.6.18-92.1...PAE i686

2008-10-28 Thread Jason Venner
We are seeing some strange lockups on a couple of our machines (in multiple clusters) Basically the hadoop processes will hang on the machine (datanode, tasktracker and tasktracker$child). And if you happen to tail the log files the tail will hang, if you do a find in the dfs data directory

Re: Understanding file splits

2008-10-28 Thread Doug Cutting
This is hard to diagnose without knowing your InputFormat. Each split returned by your #getSplits() implementation is passed to your #getRecordReader() implementation. If your RecordReader is not stopping when you expect it to, then that's a problem in your RecordReader, no? Have you written

Re: I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 7:53 AM, David M. Coe wrote: My mapper is Mapper and my reducer is the identity. I configure the program using: conf.setOutputKeyClass(IntWritable.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setReducerClass(IdentityRedu

Re: Understanding file splits

2008-10-28 Thread Owen O'Malley
On Oct 28, 2008, at 6:29 AM, Malcolm Matalka wrote: I am trying to write an InputFormat and I am having some trouble understanding how my data is being broken up. My input is a previous hadoop job and I have added code to my record reader to print out the FileSplit's start and end position, as

I am attempting to use setOutputValueGroupingComparator as a secondary sort on the values

2008-10-28 Thread David M. Coe
I am attempting to write a map/reduce that will sort by the key and then by the values. The output should look like: 0 0 0 1 0 5 0 123 0 89245 1 0 1 234 1 23423 My mapper is Mapper and my reducer is the identity. I configure the program using: conf.setOutputKeyClass(IntWritable.class); conf.se

SecondaryNameNode on separate machine

2008-10-28 Thread Tomislav Poljak
Hi, I'm trying to implement NameNode failover (or at least NameNode local data backup), but it is hard since there is no official documentation. Pages on this subject are created, but still empty: http://wiki.apache.org/hadoop/NameNodeFailover http://wiki.apache.org/hadoop/SecondaryNameNode I ha

Understanding file splits

2008-10-28 Thread Malcolm Matalka
I am trying to write an InputFormat and I am having some trouble understanding how my data is being broken up. My input is a previous hadoop job and I have added code to my record reader to print out the FileSplit's start and end position, as well as where the last record I read was located. My r

Re: Simple MapReduce example failed

2008-10-28 Thread chaitanya krishna
Hi, I faced a similar problem sometime back. I think its the network/ communication latency between master and slaves that is an issue in your case. Try increasing the timeout interval in hadoop-site.xml. V.V.Chaitanya Krishna IIIT,Hyderabad India On Thu, Oct 16, 2008 at 4:53 AM, Lucas Di Penti

Re: namenode failure

2008-10-28 Thread Alex Loddengaard
Manually killing a process might create a situation where only a portion of your data is written to disk, and other data in queue to be written is lost. This is what has most likely caused corruption in your name node. Start by reading about bin/hadoop namenode -fsck:

Re: How does an offline Datanode come back up ?

2008-10-28 Thread Steve Loughran
wmitchell wrote: Hi All, Ive been working michael nolls multi-node cluster setup example (Running_Hadoop_On_Ubuntu_Linux) for hadoop and I have a working setup. I then on my slave machine -- which is currently running a datanode killed the process in an effort to try to simulate some sort of fai