S3 fs -ls is returning incorrect data

2008-11-14 Thread Josh Ferguson
Hi all, I'm pretty new to hadoop and I was testing out using S3 as a backend store for a few things and I am having a problem with a few of the filesystem commands. I've been looking around the web for a while and I couldn't sort it out but I'm sure it's a just a newb problem. A little in

Re: Cleaning up files in HDFS?

2008-11-14 Thread lohit
Have you tried fs.trash.interval fs.trash.interval 0 Number of minutes between trash checkpoints. If zero, the trash feature is disabled. more info about trash feature here. http://hadoop.apache.org/core/docs/current/hdfs_design.html Thanks, Lohit - Original Message From

Re: Quickstart: only replicated to 0 nodes

2008-11-14 Thread Sean Laurent
On Thu, Nov 6, 2008 at 12:45 AM, Sean Laurent <[EMAIL PROTECTED]> wrote: > > So I'm new to Hadoop and I have been trying unsuccessfully to work > through the Quickstart tutorial to get a single node working in > pseudo-distributed mode. I can't seem to put data into HDFS using > release 0.18.2 unde

Re: Cleaning up files in HDFS?

2008-11-14 Thread Alex Loddengaard
A Python script that queried HDFS through the command line (use hadoop fs -lsr) would definitely suffice. I don't know of any toolsets of frameworks for pruning HDFS, other than this: Alex On Fri, Nov 14, 2008 at 5:08 PM, Erik Holstad <[EMAIL PR

Cleaning up files in HDFS?

2008-11-14 Thread Erik Holstad
Hi! We would like to run a delete script that deletes all files older than x days that are stored in lib l in hdfs, what is the best way of doing that? Regards Erik

Hadoop User Group (Bay Area) Nov 19th

2008-11-14 Thread Ajay Anand
The next Bay Area Hadoop User Group meeting is scheduled for Wednesday, November 19th at Yahoo! 2821 Mission College Blvd, Santa Clara, Building 1, Training Rooms 3 & 4 from 6:00-7:30 pm. Join us for a talk on Chukwa, a Hadoop based large scale performance monitoring system. Please register at: h

RE: Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Dan Segel
Please remove me from these emails! -Original Message- From: Kevin Peterson <[EMAIL PROTECTED]> Sent: Friday, November 14, 2008 7:33 PM To: core-user@hadoop.apache.org Subject: Cannot access svn.apache.org -- mirror? I'm trying to import Hadoop Core into our local repository using piston

Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Kevin Peterson
I'm trying to import Hadoop Core into our local repository using piston ( http://piston.rubyforge.org/index.html ). I can't seem to access svn.apache.org though. I've also tried the EU mirror. No errors, nothing but eventual timeout. Traceroute fails at corv-car1-gw.nero.net. I got the same errors

Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit
Yes that is right whatever you did. One last check. In secondary namenode log you should see the timestamp of last checkpoint. (or download of edits). Just make sure those are before when you run delete command. Basically, trying to make sure your delete command isn't in edits. (Another way wou

Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik
I had a secondary namenode running on the namenode machine. I deleted the dfs.name.dir then bin/hadoop namenode -importCheckpoint. and restarted the dfs. I guess the deletion of name.dir will delete the edit logs. Can u pl tell me that this will not lead to replaying the delete transactions ?

Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit
NameNode would not come out of safe mode as it is still waiting for datanodes to report those blocks which it expects. I should have added, try to get a full output of fsck fsck -openforwrite -files -blocks -location. -openforwrite files should tell you what files where open during the checkpoi

Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik
Hey Lohit, Thanks for you help. I did as per your suggestion. imported from secondary namenode. we have some corrupted files. But for some reason, the namenode is still in safe_mode. It has been an hour or so. The fsck report is : Total size:6954466496842 B (Total open files size: 5434692

Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit
If you have enabled thrash. They should be moved to trash folder before permanently deleting them, restore them back. (hope you have that set fs.trash.interval) If not Shut down the cluster. Take backup of you dfs.data.dir (both on namenode and secondary namenode). Secondary namenode should h

RE: Any suggestion on performance improvement ?

2008-11-14 Thread souravm
Hi Alex, I get 30-40 secs of response time for around 60MB of data. The number of Map and Reduce task is 1 each. This is because the default HDFS block size is 64 MB and Pig assigns 1 Map task for each HDFS block - I believe that is optimal. Now this being the unit of performance even if I incr

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
The image and edits files are copied to the secondary namenode periodically, so if you provision a new namenode from the secondary namenode, then your new namenode may be lacking state that the original namenode had. You should grab from the namenode NFS mount, not from the secondary namenode imag

Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik
Hi, I accidentally deleted the root folder in our hdfs. I have stopped the hdfs Is there any way to recover the files from secondary namenode Pl help -Sagar

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Bill Au
There is a "secondary" NameNode which performs periodic checkpoints: http://wiki.apache.org/hadoop/FAQ?highlight=(secondary)#7 Are there any instructions out there on how to copy the FS image and edits log from the secondary NameNode to a new machine when the original NameNode fails? Bill On Fr

Re: Recommendations on Job Status and Dependency Management

2008-11-14 Thread Jimmy Wan
Figured I should respond to my own question and list the solution for the archives: Since I already had a bunch of existing MapReduce jobs created, I was able to quickly migrate my code to Cascading to take care of all the inter-hadoop job dependencies. By making use of the MapReduceFlow and dump

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
HDFS does have a single point of failure, and there is no way around this in its current implementation. The namenode keeps track of a FS image and and edits log. It's common for these to be stored both on the local disk and on a NFS mount. In the case when the namenode fails, a new machine can

Re: Any suggestion on performance improvement ?

2008-11-14 Thread Alex Loddengaard
How big is the data that you're loading and filtering? Your cluster is pretty small, so if you have data on the magnitude of tens or hundreds of GBs, then the performance you're describing is probably to be expected. How many map and reduce tasks are you running on each node? Alex On Thu, Nov 13

Re: Could Not Find file.out.index (Help starting Hadoop!)

2008-11-14 Thread KevinAWorkman
If I replace the mapred.job.tracker in hadoop-site with local, then the job seems to work: [EMAIL PROTECTED] hadoop-0.18.1]$ bin/hadoop jar hadoop-0.18.1-examples.jar wordcount books booksOutput 08/11/14 12:06:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId

Re: Datanode block scans

2008-11-14 Thread Steve Loughran
Raghu Angadi wrote: How often is safe depends on what probabilities you are willing to accept. I just checked on one of clusters with 4PB of data, the scanner fixes about 1 block a day. Assuming avg size of 64MB per block (pretty high), probability that 3 copies of one replica go bad in 3 wee