Namenode in inconsistent state: how to reinitialize the storage directory

2011-10-25 Thread Stephen Boesch
I am relatively new here and starting the CDH3u1 (on vmware). The nameserver is not coming up due to the following error: 2011-10-25 22:47:00,547 INFO org.apache.hadoop.hdfs.server.common.Storage: Cannot access storage directory /var/lib/hadoop-0.20/cache/hadoop/dfs/name 2011-10-25 22:47:00,549

Re: Has anyone written a program to show total use on hdfs by directory

2011-10-25 Thread Tsz Wo (Nicholas), Sze
Hi Steve, You may use the shell command "hadoop fs -count" or calling FileSystem.getContentSummary(Path f) in Java. Hope it helps. Tsz-Wo From: Steve Lewis To: mapreduce-user ; hdfs-u...@hadoop.apache.org Sent: Tuesday, October 25, 2011 5:51 PM Subject: Has

Has anyone written a program to show total use on hdfs by directory

2011-10-25 Thread Steve Lewis
While I can see file sizes with the web interface, it is very difficult to tell which directories are taking up space especially when nested by several levels -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com

How to create Output files of about fixed size

2011-10-25 Thread Mapred Learn
Hi, I am trying to create output files of fixed size by using : -Dmapred.max.split.size=6442450812 (6 Gb) But the problem is that the input Data size and metadata varies and I have to adjust above value manually to achieve fixed size. Is there a way I can programmatically determine split size

Re: Hadoop cluster on EC2: hangs on big chunks of data

2011-10-25 Thread Artem Yankov
It looks like input data is not splited correctly. It always generates only one map task and gives it to one of the nodes. I tried to pass parameters like -D mapred.max.split.size but it doesn't seem to have any effect. So the question would be: how to specify the maximum amount of input records

Re: Questions about JVM Reuse

2011-10-25 Thread Joey Echeverria
> Is the configured amount of tasks for reuse a suggestion or will it actually > use it?  For example, if I’ve configured it to use a JVM for 4 tasks, will a > TaskTracker that has 8 tasks to process use 2 JVMs?  Or does it decide if it > actually wants to reuse one up to the maximum configured num

Questions about JVM Reuse

2011-10-25 Thread Adam Shook
Hello All, I have a few questions concerning the TaskTracker's JVM re-use that I couldn't unearth some details about: Is the configured amount of tasks for reuse a suggestion or will it actually use it? For example, if I've configured it to use a JVM for 4 tasks, will a TaskTracker that has 8

Re: unsort algorithmus in map/reduce

2011-10-25 Thread Owen O'Malley
On Tue, Oct 25, 2011 at 8:35 AM, Radim Kolar wrote: > Dne 25.10.2011 14:21, Niels Basjes napsal(a): > > Why not do something very simple: Use the MD5 of the URL as the key you do >> the sorting by. >> This scales very easy and highly randomized order. >> Maybe not the optimal maximum distance, b

Re: unsort algorithmus in map/reduce

2011-10-25 Thread Radim Kolar
Dne 25.10.2011 14:21, Niels Basjes napsal(a): Why not do something very simple: Use the MD5 of the URL as the key you do the sorting by. This scales very easy and highly randomized order. Maybe not the optimal maximum distance, but certainly a very good distribution and very easy to built. I tr

Re: unsort algorithmus in map/reduce

2011-10-25 Thread Niels Basjes
Why not do something very simple: Use the MD5 of the URL as the key you do the sorting by. This scales very easy and highly randomized order. Maybe not the optimal maximum distance, but certainly a very good distribution and very easy to built. Niels Basjes 2011/10/25 Radim Kolar > Hi, i am hav

unsort algorithmus in map/reduce

2011-10-25 Thread Radim Kolar
Hi, i am having problem implementing unsort for crawler in map/reduce. I have list of URLs waiting to fetch, they needs to be reordered for maximum distance between URLs from one domain. idea is to do map URL -> domain, URL test.com, http://www.test.com/page1.html test.com, http://www.test