Re: Linear slowdown producing streaming output

2011-03-10 Thread Keith Wiley
I'm still trying to solve this problem. One person mentioned that mappers have to sort the data and that the sort buffer may be relevant but I'm seeing the same linear slowdown from the reducer, and more importantly, my data sizes are so small (a few MBs) that if the Hadoop settings

Setting up hadoop on a cluster

2011-03-10 Thread Lai Will
Hello, Currently I've been playing around with my single node cluster. I'm planning to test my code on a real cluster in the next few weeks. I've read some manuals on how to deploy it. It seems that still a lot has to be done manually. As the cluster I will be working on will probably format

Re: Setting up hadoop on a cluster

2011-03-10 Thread James Seigel
How many nodes? Sent from my mobile. Please excuse the typos. On 2011-03-10, at 7:05 AM, Lai Will l...@student.ethz.ch wrote: Hello, Currently I've been playing around with my single node cluster. I'm planning to test my code on a real cluster in the next few weeks. I've read some

Re: Setting up hadoop on a cluster

2011-03-10 Thread James Seigel
Sorry, and where are you hosting the cluster? Cloud? Physical? Garage? Sent from my mobile. Please excuse the typos. On 2011-03-10, at 7:05 AM, Lai Will l...@student.ethz.ch wrote: Hello, Currently I've been playing around with my single node cluster. I'm planning to test my code on a

DFSClient: Could not complete file

2011-03-10 Thread Chris Curtin
Hi, The last couple of days we have been seeing 10's of thousands of these errors in the logs: INFO org.apache.hadoop.hdfs.DFSClient: Could not complete file /offline/working/3/aat/_temporary/_attempt_201103100812_0024_r_03_0/4129371_172307245/part-3 retrying... When this is going on

Re: Could not obtain block

2011-03-10 Thread Todd Lipcon
[moving to common-user, since this spans both MR and HDFS - probably easier than cross-posting] Can you check the DN logs for exceeds the limit of concurrent xcievers? You may need to bump the dfs.datanode.max.xcievers parameter in hdfs-site.xml, and also possibly the nfiles ulimit. -Todd On

Efficiently partition broadly distributed keys

2011-03-10 Thread Luca Aiello
Dear users, hope this is the right list to submit this one, otherwise I apologize. I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid. I have a textual input structured in key, value pairs. My task is to

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
If I understand your problem correctly you actually need some way of knowing if you need to chop a large set with a specific key in to subsets. In mapreduce the map only has information about a single key at a time. So you need something extra. One way of handling this is to start by doing a

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Luca Aiello
Dear Niels, thanks for the quick response. So in your opinion there is nothing like a hadoop embedded tool to do this. This is what I suspected indeed. Since the key, value pairs in the initial dataset are randomly partitioned in the input files, I suppose that I can avoid the initial statistic

Open HDFS in mappers

2011-03-10 Thread maha
Hello, My main function prepares an HDFS file called inputPaths with all the input files' path such that each path is printed in a line. I set the job input path to be this hdfs file inputPaths Hence each mapper's value is something like this: -

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Niels Basjes
Hi Luca, 2011/3/10 Luca Aiello alu...@yahoo-inc.com: thanks for the quick response. So in your opinion there is nothing like a hadoop embedded tool to do this. This is what I suspected indeed. The mapreduce model simply uses the key as the pivot of the processing. In your application

RE: Efficiently partition broadly distributed keys

2011-03-10 Thread Alex Dorman
Luca, You can avoid post-processing step if you use composite keys as output from map. So if you know your input composed by 70% of A keys, 20% of Bs and 10% of Cs, you can emit from the mapper {{key, mod(key_count++,key_probability/lowest_key_probability)},{value}}. Number of reducers can

Re: Efficiently partition broadly distributed keys

2011-03-10 Thread Luca Aiello
Hi Alex, sure it helps! You are right, I can avoid the post-processing by cleverly adding an additional field just for partitioning purpose. When I said calculate probability on the fly I meant something similar to what you said: re-calculate key probability on every row you process in each

Re: Reason of Formatting Namenode

2011-03-10 Thread Edward Capriolo
On Thu, Mar 10, 2011 at 12:48 AM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Thanks Harsh, i.e why if we again format namenode after loading some data INCOMATIBLE NAMESPACE ID's error occurs. Best Regards, Adarsh Sharma Harsh J wrote: Formatting the NameNode initializes the

Re: Reason of Formatting Namenode

2011-03-10 Thread Boris Shkolnik
On the first run you want namenode to initialize its directories (where it store VERSION file, fsimage and edits). On the subsequent formats - you are making sure you have a new EMPTY file system. If you don't do format NameNode will load up fsimage and edits. There is also matter of generating

Re: Open HDFS in mappers

2011-03-10 Thread Harsh J
Once you have a JobConf/Configuration conf object in your Mapper (via setup/configure methods), you can do the following to get the default file-system impl: FileSystem fs = FileSystem.get(conf); // Gets the fs.default.name file-system impl. Then use fs to open/create/etc. any file you need

Re: Open HDFS in mappers

2011-03-10 Thread maha
Thanks for the reply Harsh as usual :) Yet, the problem is in the value of the mapper being = hdfs://localhost:9000/tmp/in/file1 , I thought I wasn't using the same HDFS but in fact I was ! using the same idea you presented. The problem however, is that the map value =

Re: Open HDFS in mappers

2011-03-10 Thread Harsh J
How do you store the filenames into the file? Instead of storing the entire Path URI (if that is the trouble [mustn't be if both your driver and cluster's fs.def.name is the same]), you can store just the Name component of the path (i.e. just /user/me/blah.txt instead of the whole proper URI).