Re: Datanode block scans

2008-11-14 Thread Steve Loughran

Raghu Angadi wrote:


How often is safe depends on what probabilities you are willing to accept.

I just checked on one of clusters with 4PB of data, the scanner fixes 
about 1 block a day. Assuming avg size of 64MB per block (pretty high), 
probability that 3 copies of one replica go bad in 3 weeks is of the 
range 1e-12. In reality it is mostly 2-3 orders less probable.


Raghu.



That's quite interesting data. Any plans to publish a paper on disk 
failures in an HDFS cluster?


on a related note: do you ever scan the rest of the disk for trouble, 
that is the OS filesystem as root, just to catch problems in the server 
itself that could lead to failing jobs?





Re: Could Not Find file.out.index (Help starting Hadoop!)

2008-11-14 Thread KevinAWorkman

If I replace the mapred.job.tracker in hadoop-site with local, then the job
seems to work:

[EMAIL PROTECTED] hadoop-0.18.1]$ bin/hadoop jar hadoop-0.18.1-examples.jar
wordcount books booksOutput
08/11/14 12:06:13 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
08/11/14 12:06:13 INFO mapred.FileInputFormat: Total input paths to process
: 3
08/11/14 12:06:13 INFO mapred.FileInputFormat: Total input paths to process
: 3
08/11/14 12:06:13 INFO mapred.FileInputFormat: Total input paths to process
: 3
08/11/14 12:06:13 INFO mapred.FileInputFormat: Total input paths to process
: 3
08/11/14 12:06:14 INFO mapred.JobClient: Running job: job_local_0001
08/11/14 12:06:14 INFO mapred.MapTask: numReduceTasks: 1
08/11/14 12:06:14 INFO mapred.MapTask: io.sort.mb = 100
08/11/14 12:06:14 INFO mapred.MapTask: data buffer = 79691776/99614720
08/11/14 12:06:14 INFO mapred.MapTask: record buffer = 262144/327680
08/11/14 12:06:14 INFO mapred.MapTask: Starting flush of map output
08/11/14 12:06:14 INFO mapred.MapTask: bufstart = 0; bufend = 1086784;
bufvoid = 99614720
08/11/14 12:06:14 INFO mapred.MapTask: kvstart = 0; kvend = 109855; length =
327680
08/11/14 12:06:14 INFO mapred.MapTask: Index: (0, 267034, 267034)
08/11/14 12:06:14 INFO mapred.MapTask: Finished spill 0
08/11/14 12:06:15 INFO mapred.LocalJobRunner:
hdfs://localhost:54310/user/hadoop/books/one.txt:0+662001
08/11/14 12:06:15 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_00_0' done.
08/11/14 12:06:15 INFO mapred.TaskRunner: Saved output of task
'attempt_local_0001_m_00_0' to
hdfs://localhost:54310/user/hadoop/booksOutput
08/11/14 12:06:15 INFO mapred.JobClient:  map 100% reduce 0%
08/11/14 12:06:15 INFO mapred.MapTask: numReduceTasks: 1
08/11/14 12:06:15 INFO mapred.MapTask: io.sort.mb = 100
08/11/14 12:06:15 INFO mapred.MapTask: data buffer = 79691776/99614720
08/11/14 12:06:15 INFO mapred.MapTask: record buffer = 262144/327680
08/11/14 12:06:15 INFO mapred.MapTask: Spilling map output: buffer full =
false and record full = true
08/11/14 12:06:15 INFO mapred.MapTask: bufstart = 0; bufend = 2545957;
bufvoid = 99614720
08/11/14 12:06:15 INFO mapred.MapTask: kvstart = 0; kvend = 262144; length =
327680
08/11/14 12:06:15 INFO mapred.MapTask: Starting flush of map output
08/11/14 12:06:16 INFO mapred.MapTask: Index: (0, 717078, 717078)
08/11/14 12:06:16 INFO mapred.MapTask: Finished spill 0
08/11/14 12:06:16 INFO mapred.MapTask: bufstart = 2545957; bufend = 2601773;
bufvoid = 99614720
08/11/14 12:06:16 INFO mapred.MapTask: kvstart = 262144; kvend = 267975;
length = 327680
08/11/14 12:06:16 INFO mapred.MapTask: Index: (0, 23156, 23156)
08/11/14 12:06:16 INFO mapred.MapTask: Finished spill 1
08/11/14 12:06:16 INFO mapred.Merger: Merging 2 sorted segments
08/11/14 12:06:16 INFO mapred.Merger: Down to the last merge-pass, with 2
segments left of total size: 740234 bytes
08/11/14 12:06:16 INFO mapred.MapTask: Index: (0, 740232, 740232)
08/11/14 12:06:16 INFO mapred.LocalJobRunner:
hdfs://localhost:54310/user/hadoop/books/three.txt:0+1539989
08/11/14 12:06:16 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_01_0' done.
08/11/14 12:06:16 INFO mapred.TaskRunner: Saved output of task
'attempt_local_0001_m_01_0' to
hdfs://localhost:54310/user/hadoop/booksOutput
08/11/14 12:06:17 INFO mapred.MapTask: numReduceTasks: 1
08/11/14 12:06:17 INFO mapred.MapTask: io.sort.mb = 100
08/11/14 12:06:17 INFO mapred.MapTask: data buffer = 79691776/99614720
08/11/14 12:06:17 INFO mapred.MapTask: record buffer = 262144/327680
08/11/14 12:06:17 INFO mapred.MapTask: Starting flush of map output
08/11/14 12:06:17 INFO mapred.MapTask: bufstart = 0; bufend = 2387689;
bufvoid = 99614720
08/11/14 12:06:17 INFO mapred.MapTask: kvstart = 0; kvend = 251356; length =
327680
08/11/14 12:06:18 INFO mapred.MapTask: Index: (0, 466648, 466648)
08/11/14 12:06:18 INFO mapred.MapTask: Finished spill 0
08/11/14 12:06:18 INFO mapred.LocalJobRunner:
hdfs://localhost:54310/user/hadoop/books/two.txt:0+1391690
08/11/14 12:06:18 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_02_0' done.
08/11/14 12:06:18 INFO mapred.TaskRunner: Saved output of task
'attempt_local_0001_m_02_0' to
hdfs://localhost:54310/user/hadoop/booksOutput
08/11/14 12:06:18 INFO mapred.ReduceTask: Initiating final on-disk merge
with 3 files
08/11/14 12:06:18 INFO mapred.Merger: Merging 3 sorted segments
08/11/14 12:06:18 INFO mapred.Merger: Down to the last merge-pass, with 3
segments left of total size: 1473914 bytes
08/11/14 12:06:18 INFO mapred.LocalJobRunner: reduce  reduce
08/11/14 12:06:18 INFO mapred.TaskRunner: Task
'attempt_local_0001_r_00_0' done.
08/11/14 12:06:18 INFO mapred.TaskRunner: Saved output of task
'attempt_local_0001_r_00_0' to
hdfs://localhost:54310/user/hadoop/booksOutput
08/11/14 12:06:19 INFO mapred.JobClient: Job complete: job_local_0001
08/11/14 12:06:19 INFO mapred.JobClient: Counters: 13
08/11/14 12:06:19 INFO mapred.JobClient:   File 

Re: Any suggestion on performance improvement ?

2008-11-14 Thread Alex Loddengaard
How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm [EMAIL PROTECTED] wrote:

 Hi,

 I'm testing with a 4 node setup of Hadoop hdfs.

 Each node has configuration of 2GB memory and dual core and around 30-60 GB
 disk space.

 I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

 I'm querying those files using PIG. What I'm seeing that even a simple
 select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
 process in one node takes at least 25 sec.

 I've kept the jvm max heap size to 1024m.

 Any suggestion on how to improve the performance with different
 configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

 Regards,
 Sourav

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has
 taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***



Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
HDFS does have a single point of failure, and there is no way around this in
its current implementation.  The namenode keeps track of a FS image and and
edits log.  It's common for these to be stored both on the local disk and on
a NFS mount.  In the case when the namenode fails, a new machine can be
provisioned to be the namenode by loading the backed-up image and edits
files.
Can you say more about how you'll use HDFS?  It's not a very latent file
system, so it shouldn't be used to serve images, videos, etc in a web
environment.  It's most common use is to be the basis of batch Map/Reduce
jobs.

Alex

On Thu, Nov 13, 2008 at 5:18 PM, S. L. [EMAIL PROTECTED] wrote:

 Hi list
 I am kind of new to Hadoop but have some good background. I am seriously
 considering adopting Hadoop and especially HDFS first to be able to store
 various files (in the low hundreds thousands at first) on a few nodes in a
 manner where I don't need a RAID system or a SAN. HDFS seems a perfect fit
 for the job...

 BUT

 from what I learn in the past couple days it seems that the single point of
 failure in HDFS is the NameNode. So I was wondering if anyone in the list
 that did deploy HDFS in a production environment on what is their strategy
 for High Availability of the system... Having the NameNode unavailable is
 basically bringing the whole HDFS system offline. So what are the scripts
 or
 other techniques recommended to add H.A to HDFS !

 Thank !

 -- S.



Re: Recommendations on Job Status and Dependency Management

2008-11-14 Thread Jimmy Wan
Figured I should respond to my own question and list the solution for
the archives:

Since I already had a bunch of existing MapReduce jobs created, I was able to
quickly migrate my code to Cascading to take care of all the inter-hadoop
job dependencies.

By making use of the MapReduceFlow and dumping those flows into a Cascade
with a CascadeConnector, I was able to throw out several hundred lines of
hand-created Thread and dependency management code in favor of an automated
solution that actually worked a wee bit better in terms of concurrency. I was
able to see an immediate increase in the utilization of my cluster.

I covered how I worked out the initial HDFS-dependencies in the other reply
to this message.

For determining the proper way to determine whether the trigger conditions
are met (reliance on outside processes for which there is no easy way to read
a signal), I'm currently polling a database for that data and I'm working
with Chris to add a hook into Cascade to allow pluggable predicates to
specify that condition.

So yeah, I'm sold on Cascading. =)

Relevant links:
http://www.cascading.org

Relevant API Docs
http://www.cascading.org/javadoc/cascading/flow/MapReduceFlow.html
http://www.cascading.org/javadoc/cascading/cascade/CascadeConnector.html
http://www.cascading.org/javadoc/cascading/cascade/Cascade.html

On Tue, 11 Nov 2008, Jimmy Wan wrote:

I'd like to take my prototype batch processing of hadoop jobs and implement
some type of real dependency management and scheduling in order to better
utilize my cluster as well as spread out more work over time. I was thinking
of adopting one of the existing packages (Cascading, Zookeeper, existing
JobControl?) and I was hoping to find some better advice from the mailing
list. I tried to find a more direct comparison of Cascading and Zookeeper but
I couldn't find one.

This is a grossly simplified description my current completely naive
approach:

1) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs.

2) for each day in a month, spawn N threads that each contain a dependent
series of map/reduce jobs that are dependent on the output of step #1. These
are currently separated from the tasks in step #1 mainly because it's easier
to group them up this way in the event of a failure, but I expect this
separation to go away.

3) At the end of the month, serially run a series of jobs outside of
Map/Reduce that basically consist of a single SQL query (I could easily
convert these to be very simple map/reduce jobs, and probably will, if it
makes my job processing easier).

The main problems I have are the following:
1) right now I have a hard time determining which processes need to be run
in the event of a failure.

Every job has an expected input/output in HDFS so if I have to rerun
something I usually just use something like hadoop dfs -rmr path in a
shell script then hand edit the jobs that need to be rerun.

Is there an example somewhere of code that can read HDFS in order to
determine if files exist? I poked around a bit and couldn't find one.
Ideally, my code would be able to read the HDFS config info right out of the
standard config files so I wouldn't need to create additional configuration
information.

The job dependencies while enumerated well are not isolated all that well.
Example: I find a bug in 1 of 10 processes in step #1. I'd like to rerun just
that one process and any dependent processes, but not have to rerun
everything again.

2) I typically run everything 1 month at a time, but I want to keep the
option of doing rollups by day. On the 2nd of the month, I'd like to be able
to run anything that requires data from the 1st of the month. On the 1st of
the month, I'd like to run anything that requires a full month of data from
the previous month.

I'd also like my process to be able to account for system failures on
previous days. i.e. On any given day I'd like to be able to run everything
for which data is available.

3) Certain types of jobs have external dependencies (ex. MySQL) and I don't
want to run too many of those types of jobs at the same time since it affects
my MySQL performance. I'd like some way of describing some type of
lock on external resources that can be shared across jobs.

Any recommendations on how to best model these things?

I'm thinking that something like Cascading or Zookeeper could help me here.
My initial take was that Zookeeper was more heavyweight than Cascading,
requiring additional processes to be running at all times. However, it seems
like Zookeeper would be better suited to describing mutual exclusions on
usage of external resources. Can Cascading even do this?

I'd also appreciate any recommendations on how best to tune the hadoop
processes. My hadoop 0.16.4 cluster is currently relatively small (10 nodes)
so I'm thinking the 1GB defaults for my NameNode, DataNodes, and JobTracker
might be overkill. I also plan to upgrade to 0.17.* or 0.18.* at some point
in 

Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Bill Au
There is a secondary NameNode which performs periodic checkpoints:

http://wiki.apache.org/hadoop/FAQ?highlight=(secondary)#7

Are there any instructions out there on how to copy the FS image and edits
log from the secondary NameNode to a new machine when the original NameNode
fails?

Bill

On Fri, Nov 14, 2008 at 12:50 PM, Alex Loddengaard [EMAIL PROTECTED]wrote:

 HDFS does have a single point of failure, and there is no way around this
 in
 its current implementation.  The namenode keeps track of a FS image and and
 edits log.  It's common for these to be stored both on the local disk and
 on
 a NFS mount.  In the case when the namenode fails, a new machine can be
 provisioned to be the namenode by loading the backed-up image and edits
 files.
 Can you say more about how you'll use HDFS?  It's not a very latent file
 system, so it shouldn't be used to serve images, videos, etc in a web
 environment.  It's most common use is to be the basis of batch Map/Reduce
 jobs.

 Alex

 On Thu, Nov 13, 2008 at 5:18 PM, S. L. [EMAIL PROTECTED] wrote:

  Hi list
  I am kind of new to Hadoop but have some good background. I am seriously
  considering adopting Hadoop and especially HDFS first to be able to store
  various files (in the low hundreds thousands at first) on a few nodes in
 a
  manner where I don't need a RAID system or a SAN. HDFS seems a perfect
 fit
  for the job...
 
  BUT
 
  from what I learn in the past couple days it seems that the single point
 of
  failure in HDFS is the NameNode. So I was wondering if anyone in the list
  that did deploy HDFS in a production environment on what is their
 strategy
  for High Availability of the system... Having the NameNode unavailable is
  basically bringing the whole HDFS system offline. So what are the scripts
  or
  other techniques recommended to add H.A to HDFS !
 
  Thank !
 
  -- S.
 



Re: HDFS NameNode and HA: best strategy?

2008-11-14 Thread Alex Loddengaard
The image and edits files are copied to the secondary namenode periodically,
so if you provision a new namenode from the secondary namenode, then your
new namenode may be lacking state that the original namenode had.  You
should grab from the namenode NFS mount, not from the secondary namenode
image and edits files.
As for a script to do this, I'm not aware of one.  However, it should be as
easy as a SCP or a RSYNC, a call to start-all.sh, etc.

Alex

On Fri, Nov 14, 2008 at 10:20 AM, Bill Au [EMAIL PROTECTED] wrote:

 There is a secondary NameNode which performs periodic checkpoints:

 http://wiki.apache.org/hadoop/FAQ?highlight=(secondary)#7

 Are there any instructions out there on how to copy the FS image and edits
 log from the secondary NameNode to a new machine when the original NameNode
 fails?

 Bill

 On Fri, Nov 14, 2008 at 12:50 PM, Alex Loddengaard [EMAIL PROTECTED]
 wrote:

  HDFS does have a single point of failure, and there is no way around this
  in
  its current implementation.  The namenode keeps track of a FS image and
 and
  edits log.  It's common for these to be stored both on the local disk and
  on
  a NFS mount.  In the case when the namenode fails, a new machine can be
  provisioned to be the namenode by loading the backed-up image and edits
  files.
  Can you say more about how you'll use HDFS?  It's not a very latent file
  system, so it shouldn't be used to serve images, videos, etc in a web
  environment.  It's most common use is to be the basis of batch Map/Reduce
  jobs.
 
  Alex
 
  On Thu, Nov 13, 2008 at 5:18 PM, S. L. [EMAIL PROTECTED] wrote:
 
   Hi list
   I am kind of new to Hadoop but have some good background. I am
 seriously
   considering adopting Hadoop and especially HDFS first to be able to
 store
   various files (in the low hundreds thousands at first) on a few nodes
 in
  a
   manner where I don't need a RAID system or a SAN. HDFS seems a perfect
  fit
   for the job...
  
   BUT
  
   from what I learn in the past couple days it seems that the single
 point
  of
   failure in HDFS is the NameNode. So I was wondering if anyone in the
 list
   that did deploy HDFS in a production environment on what is their
  strategy
   for High Availability of the system... Having the NameNode unavailable
 is
   basically bringing the whole HDFS system offline. So what are the
 scripts
   or
   other techniques recommended to add H.A to HDFS !
  
   Thank !
  
   -- S.
  
 



RE: Any suggestion on performance improvement ?

2008-11-14 Thread souravm
Hi Alex,

I get 30-40 secs of response time for around 60MB of data. The number of Map 
and Reduce task is 1 each. This is because the default HDFS block size is 64 MB 
and Pig assigns 1 Map task for each HDFS block - I believe that is optimal.

Now this being the unit of performance even if I increase the number of node I 
don't think the performance would be better.

Regards,
Sourav
-Original Message-
From: Alex Loddengaard [mailto:[EMAIL PROTECTED] 
Sent: Friday, November 14, 2008 9:44 AM
To: core-user@hadoop.apache.org
Subject: Re: Any suggestion on performance improvement ?

How big is the data that you're loading and filtering?  Your cluster is
pretty small, so if you have data on the magnitude of tens or hundreds of
GBs, then the performance you're describing is probably to be expected.
How many map and reduce tasks are you running on each node?

Alex

On Thu, Nov 13, 2008 at 4:55 PM, souravm [EMAIL PROTECTED] wrote:

 Hi,

 I'm testing with a 4 node setup of Hadoop hdfs.

 Each node has configuration of 2GB memory and dual core and around 30-60 GB
 disk space.

 I've kept files of different sizes in the hdfs ranging from 10MB to 5 GB.

 I'm querying those files using PIG. What I'm seeing that even a simple
 select query (LOAD and FILTER) is taking at least 30-40 sec of time. The MAP
 process in one node takes at least 25 sec.

 I've kept the jvm max heap size to 1024m.

 Any suggestion on how to improve the performance with different
 configuration at Hadoop level (by changing hdfs and MapReduce parameters) ?

 Regards,
 Sourav

  CAUTION - Disclaimer *
 This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
 solely
 for the use of the addressee(s). If you are not the intended recipient,
 please
 notify the sender by e-mail and delete the original message. Further, you
 are not
 to copy, disclose, or distribute this e-mail or its contents to any other
 person and
 any such actions are unlawful. This e-mail may contain viruses. Infosys has
 taken
 every reasonable precaution to minimize this risk, but is not liable for
 any damage
 you may sustain as a result of any virus in this e-mail. You should carry
 out your
 own virus checks before opening the e-mail or attachment. Infosys reserves
 the
 right to monitor and review the content of all messages sent to or from
 this e-mail
 address. Messages sent to or from this e-mail address may be stored on the
 Infosys e-mail system.
 ***INFOSYS End of Disclaimer INFOSYS***



Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit


If you have enabled thrash. They should be moved to trash folder before 
permanently deleting them, restore them back. (hope you have that set 
fs.trash.interval)

If not Shut down the cluster.
Take backup of you dfs.data.dir (both on namenode and secondary namenode).

Secondary namenode should have last updated image, try to start namenode from 
that image, dont use the edits from namenode yet. Try do importCheckpoint 
explained in here 
https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173.
 Start only namenode and run fsck -files. it will throw lot of messages saying 
you are missing blocks but thats fine since you havent started datanodes yet. 
But if it shows your files, that means they havent been deleted yet. 
This will give you a view of system of last backup. Start datanode If its up, 
try running fsck and check consistency of the sytem. you would lose all changes 
that has happened since the last checkpoint. 


Hope that helps,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 10:38:45 AM
Subject: Recovery of files in hadoop 18

Hi,
I accidentally deleted the root folder in our hdfs.
I have stopped the hdfs

Is there any way to recover the files from secondary namenode

Pl help


-Sagar


Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik

Hey Lohit,

Thanks for you help.
I did as per your suggestion. imported from secondary namenode.
we have some corrupted files.

But for some reason, the namenode is still in safe_mode. It has been an 
hour or so.

The fsck report is :

Total size:6954466496842 B (Total open files size: 543469222 B)
Total dirs:1159
Total files:   1354155 (Files currently being written: 7673)
Total blocks (validated):  1375725 (avg. block size 5055128 B) 
(Total open file blocks (not validated): 50)

 
 CORRUPT FILES:1574
 MISSING BLOCKS:   1574
 MISSING SIZE: 1165735334 B
 CORRUPT BLOCKS:   1574
 
Minimally replicated blocks:   1374151 (99.88559 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   26619 (1.9349071 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3
Average block replication: 2.977127
Corrupt blocks:1574
Missing replicas:  26752 (0.65317154 %)


Do you think, I should manually override the safemode and delete all the 
corrupted files and restart


-Sagar


lohit wrote:

If you have enabled thrash. They should be moved to trash folder before 
permanently deleting them, restore them back. (hope you have that set 
fs.trash.interval)

If not Shut down the cluster.
Take backup of you dfs.data.dir (both on namenode and secondary namenode).

Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. 
This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. 



Hope that helps,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 10:38:45 AM
Subject: Recovery of files in hadoop 18

Hi,
I accidentally deleted the root folder in our hdfs.
I have stopped the hdfs

Is there any way to recover the files from secondary namenode

Pl help


-Sagar
  




Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit
NameNode would not come out of safe mode as it is still waiting for datanodes 
to report those blocks which it expects. 
I should have added, try to get a full output of fsck
fsck path -openforwrite -files -blocks -location.
-openforwrite files should tell you what files where open during the 
checkpoint, you might want to double check that is the case, the files were 
being writting during that moment. May be by looking at the filename you could 
tell if that was part of a job which was running.

For any missing block, you might also want to cross verify on the datanode to 
see if is really missing.

Once you are convinced that those are the only corrupt files which you can live 
with, start datanodes. 
Namenode woudl still not come out of safemode as you have missing blocks, leave 
it for a while, run fsck look around, if everything ok, bring namenode out of 
safemode.
I hope you had started this namenode with old image and empty edits. You do not 
want your latest edits to be replayed, which has your delete transactions.

Thanks,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 12:11:46 PM
Subject: Re: Recovery of files in hadoop 18

Hey Lohit,

Thanks for you help.
I did as per your suggestion. imported from secondary namenode.
we have some corrupted files.

But for some reason, the namenode is still in safe_mode. It has been an hour or 
so.
The fsck report is :

Total size:6954466496842 B (Total open files size: 543469222 B)
Total dirs:1159
Total files:   1354155 (Files currently being written: 7673)
Total blocks (validated):  1375725 (avg. block size 5055128 B) (Total open 
file blocks (not validated): 50)

CORRUPT FILES:1574
MISSING BLOCKS:   1574
MISSING SIZE: 1165735334 B
CORRUPT BLOCKS:   1574

Minimally replicated blocks:   1374151 (99.88559 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   26619 (1.9349071 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3
Average block replication: 2.977127
Corrupt blocks:1574
Missing replicas:  26752 (0.65317154 %)


Do you think, I should manually override the safemode and delete all the 
corrupted files and restart

-Sagar


lohit wrote:
 If you have enabled thrash. They should be moved to trash folder before 
 permanently deleting them, restore them back. (hope you have that set 
 fs.trash.interval)
 
 If not Shut down the cluster.
 Take backup of you dfs.data.dir (both on namenode and secondary namenode).
 
 Secondary namenode should have last updated image, try to start namenode from 
 that image, dont use the edits from namenode yet. Try do importCheckpoint 
 explained in here 
 https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173.
  Start only namenode and run fsck -files. it will throw lot of messages 
 saying you are missing blocks but thats fine since you havent started 
 datanodes yet. But if it shows your files, that means they havent been 
 deleted yet. This will give you a view of system of last backup. Start 
 datanode If its up, try running fsck and check consistency of the sytem. you 
 would lose all changes that has happened since the last checkpoint. 
 
 Hope that helps,
 Lohit
 
 
 
 - Original Message 
 From: Sagar Naik [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, November 14, 2008 10:38:45 AM
 Subject: Recovery of files in hadoop 18
 
 Hi,
 I accidentally deleted the root folder in our hdfs.
 I have stopped the hdfs
 
 Is there any way to recover the files from secondary namenode
 
 Pl help
 
 
 -Sagar
  


Re: Recovery of files in hadoop 18

2008-11-14 Thread Sagar Naik

I had a secondary namenode running on the namenode machine.
I deleted the dfs.name.dir
then bin/hadoop namenode -importCheckpoint.

and restarted the dfs.

I guess the deletion of name.dir will delete the edit logs.
Can u pl tell me that this will not lead to replaying the delete 
transactions ?


Thanks for help/advice


-Sagar

lohit wrote:
NameNode would not come out of safe mode as it is still waiting for datanodes to report those blocks which it expects. 
I should have added, try to get a full output of fsck

fsck path -openforwrite -files -blocks -location.
-openforwrite files should tell you what files where open during the 
checkpoint, you might want to double check that is the case, the files were 
being writting during that moment. May be by looking at the filename you could 
tell if that was part of a job which was running.

For any missing block, you might also want to cross verify on the datanode to 
see if is really missing.

Once you are convinced that those are the only corrupt files which you can live with, start datanodes. 
Namenode woudl still not come out of safemode as you have missing blocks, leave it for a while, run fsck look around, if everything ok, bring namenode out of safemode.

I hope you had started this namenode with old image and empty edits. You do not 
want your latest edits to be replayed, which has your delete transactions.

Thanks,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 12:11:46 PM
Subject: Re: Recovery of files in hadoop 18

Hey Lohit,

Thanks for you help.
I did as per your suggestion. imported from secondary namenode.
we have some corrupted files.

But for some reason, the namenode is still in safe_mode. It has been an hour or 
so.
The fsck report is :

Total size:6954466496842 B (Total open files size: 543469222 B)
Total dirs:1159
Total files:   1354155 (Files currently being written: 7673)
Total blocks (validated):  1375725 (avg. block size 5055128 B) (Total open 
file blocks (not validated): 50)

CORRUPT FILES:1574
MISSING BLOCKS:   1574
MISSING SIZE: 1165735334 B
CORRUPT BLOCKS:   1574

Minimally replicated blocks:   1374151 (99.88559 %)
Over-replicated blocks:0 (0.0 %)
Under-replicated blocks:   26619 (1.9349071 %)
Mis-replicated blocks: 0 (0.0 %)
Default replication factor:3
Average block replication: 2.977127
Corrupt blocks:1574
Missing replicas:  26752 (0.65317154 %)


Do you think, I should manually override the safemode and delete all the 
corrupted files and restart

-Sagar


lohit wrote:
  

If you have enabled thrash. They should be moved to trash folder before 
permanently deleting them, restore them back. (hope you have that set 
fs.trash.interval)

If not Shut down the cluster.
Take backup of you dfs.data.dir (both on namenode and secondary namenode).

Secondary namenode should have last updated image, try to start namenode from that image, dont use the edits from namenode yet. Try do importCheckpoint explained in here https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173. Start only namenode and run fsck -files. it will throw lot of messages saying you are missing blocks but thats fine since you havent started datanodes yet. But if it shows your files, that means they havent been deleted yet. This will give you a view of system of last backup. Start datanode If its up, try running fsck and check consistency of the sytem. you would lose all changes that has happened since the last checkpoint. 


Hope that helps,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 10:38:45 AM
Subject: Recovery of files in hadoop 18

Hi,
I accidentally deleted the root folder in our hdfs.
I have stopped the hdfs

Is there any way to recover the files from secondary namenode

Pl help


-Sagar
 





Re: Recovery of files in hadoop 18

2008-11-14 Thread lohit
Yes that is right whatever you did. One last check.
In secondary namenode log you should see the timestamp of last checkpoint. (or 
download of edits). Just make sure those are before when you run delete 
command. 
Basically, trying to make sure your delete command isn't in edits. (Another way 
woudl have been to open edits in hex editor or similar to check) , but this 
should work.
Once done, you could start.
Thanks,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 1:59:04 PM
Subject: Re: Recovery of files in hadoop 18

I had a secondary namenode running on the namenode machine.
I deleted the dfs.name.dir
then bin/hadoop namenode -importCheckpoint.

and restarted the dfs.

I guess the deletion of name.dir will delete the edit logs.
Can u pl tell me that this will not lead to replaying the delete 
transactions ?

Thanks for help/advice


-Sagar

lohit wrote:
 NameNode would not come out of safe mode as it is still waiting for datanodes 
 to report those blocks which it expects. 
 I should have added, try to get a full output of fsck
 fsck path -openforwrite -files -blocks -location.
 -openforwrite files should tell you what files where open during the 
 checkpoint, you might want to double check that is the case, the files were 
 being writting during that moment. May be by looking at the filename you 
 could tell if that was part of a job which was running.

 For any missing block, you might also want to cross verify on the datanode to 
 see if is really missing.

 Once you are convinced that those are the only corrupt files which you can 
 live with, start datanodes. 
 Namenode woudl still not come out of safemode as you have missing blocks, 
 leave it for a while, run fsck look around, if everything ok, bring namenode 
 out of safemode.
 I hope you had started this namenode with old image and empty edits. You do 
 not want your latest edits to be replayed, which has your delete transactions.

 Thanks,
 Lohit



 - Original Message 
 From: Sagar Naik [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, November 14, 2008 12:11:46 PM
 Subject: Re: Recovery of files in hadoop 18

 Hey Lohit,

 Thanks for you help.
 I did as per your suggestion. imported from secondary namenode.
 we have some corrupted files.

 But for some reason, the namenode is still in safe_mode. It has been an hour 
 or so.
 The fsck report is :

 Total size:6954466496842 B (Total open files size: 543469222 B)
 Total dirs:1159
 Total files:   1354155 (Files currently being written: 7673)
 Total blocks (validated):  1375725 (avg. block size 5055128 B) (Total 
 open file blocks (not validated): 50)
 
 CORRUPT FILES:1574
 MISSING BLOCKS:   1574
 MISSING SIZE: 1165735334 B
 CORRUPT BLOCKS:   1574
 
 Minimally replicated blocks:   1374151 (99.88559 %)
 Over-replicated blocks:0 (0.0 %)
 Under-replicated blocks:   26619 (1.9349071 %)
 Mis-replicated blocks: 0 (0.0 %)
 Default replication factor:3
 Average block replication: 2.977127
 Corrupt blocks:1574
 Missing replicas:  26752 (0.65317154 %)


 Do you think, I should manually override the safemode and delete all the 
 corrupted files and restart

 -Sagar


 lohit wrote:
  
 If you have enabled thrash. They should be moved to trash folder before 
 permanently deleting them, restore them back. (hope you have that set 
 fs.trash.interval)

 If not Shut down the cluster.
 Take backup of you dfs.data.dir (both on namenode and secondary namenode).

 Secondary namenode should have last updated image, try to start namenode 
 from that image, dont use the edits from namenode yet. Try do 
 importCheckpoint explained in here 
 https://issues.apache.org/jira/browse/HADOOP-2585?focusedCommentId=12558173#action_12558173.
  Start only namenode and run fsck -files. it will throw lot of messages 
 saying you are missing blocks but thats fine since you havent started 
 datanodes yet. But if it shows your files, that means they havent been 
 deleted yet. This will give you a view of system of last backup. Start 
 datanode If its up, try running fsck and check consistency of the sytem. you 
 would lose all changes that has happened since the last checkpoint. 

 Hope that helps,
 Lohit



 - Original Message 
 From: Sagar Naik [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Friday, November 14, 2008 10:38:45 AM
 Subject: Recovery of files in hadoop 18

 Hi,
 I accidentally deleted the root folder in our hdfs.
 I have stopped the hdfs

 Is there any way to recover the files from secondary namenode

 Pl help


 -Sagar
  



Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Kevin Peterson
I'm trying to import Hadoop Core into our local repository using piston
( http://piston.rubyforge.org/index.html ).

I can't seem to access svn.apache.org though. I've also tried the EU
mirror. No errors, nothing but eventual timeout. Traceroute fails at
corv-car1-gw.nero.net. I got the same errors a couple weeks ago, but
assumed they were just temporary downtime. I have found some messages
from earlier this year about a similar problem where some people can
access it fine, and others just can't connect. I'm able to access it
from a remote shell account, but not from my machine.

Has anyone been able to work around this? Is there any mirror of the
Hadoop repository?


RE: Cannot access svn.apache.org -- mirror?

2008-11-14 Thread Dan Segel
Please remove me from these emails!

-Original Message-
From: Kevin Peterson [EMAIL PROTECTED]
Sent: Friday, November 14, 2008 7:33 PM
To: core-user@hadoop.apache.org
Subject: Cannot access svn.apache.org -- mirror?

I'm trying to import Hadoop Core into our local repository using piston
( http://piston.rubyforge.org/index.html ).

I can't seem to access svn.apache.org though. I've also tried the EU
mirror. No errors, nothing but eventual timeout. Traceroute fails at
corv-car1-gw.nero.net. I got the same errors a couple weeks ago, but
assumed they were just temporary downtime. I have found some messages
from earlier this year about a similar problem where some people can
access it fine, and others just can't connect. I'm able to access it
from a remote shell account, but not from my machine.

Has anyone been able to work around this? Is there any mirror of the
Hadoop repository?



Cleaning up files in HDFS?

2008-11-14 Thread Erik Holstad
Hi!
We would like to run a delete script that deletes all files older than
x days that are stored in lib l in hdfs, what is the best way of doing that?

Regards Erik


Re: Cleaning up files in HDFS?

2008-11-14 Thread Alex Loddengaard
A Python script that queried HDFS through the command line (use hadoop fs
-lsr) would definitely suffice.  I don't know of any toolsets of frameworks
for pruning HDFS, other than this:
http://issues.apache.org/jira/browse/HADOOP-4412

Alex

On Fri, Nov 14, 2008 at 5:08 PM, Erik Holstad [EMAIL PROTECTED] wrote:

 Hi!
 We would like to run a delete script that deletes all files older than
 x days that are stored in lib l in hdfs, what is the best way of doing
 that?

 Regards Erik



Re: Quickstart: only replicated to 0 nodes

2008-11-14 Thread Sean Laurent
On Thu, Nov 6, 2008 at 12:45 AM, Sean Laurent [EMAIL PROTECTED] wrote:

 So I'm new to Hadoop and I have been trying unsuccessfully to work
 through the Quickstart tutorial to get a single node working in
 pseudo-distributed mode. I can't seem to put data into HDFS using
 release 0.18.2 under Java 1.6.0_04-b12:

 $ bin/hadoop fs -put conf input
 08/11/05 18:32:23 INFO dfs.DFSClient:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /user/slaurent/input/commons-logging.properties could only be
 replicated to 0 nodes, instead of 1
 ...

So I finally discovered my problems... :)

First, I didn't have an entry /etc/hosts for my machine name.

Second (and far more important), the HDFS system was getting created
in /tmp and the partition on which /tmp resides was running out of
disk space. Once I moved the HDFS to a partition with enough space, my
replication problems went away.

I have to admit that it kinda seems like a bug that Hadoop never gave
me ANY indication that I was out of disk space.

-Sean


Re: Cleaning up files in HDFS?

2008-11-14 Thread lohit
Have you tried fs.trash.interval 

property
  namefs.trash.interval/name
  value0/value
  descriptionNumber of minutes between trash checkpoints.
  If zero, the trash feature is disabled.
  /description
/property

more info about trash feature here.
http://hadoop.apache.org/core/docs/current/hdfs_design.html


Thanks,
Lohit

- Original Message 
From: Erik Holstad [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Friday, November 14, 2008 5:08:03 PM
Subject: Cleaning up files in HDFS?

Hi!
We would like to run a delete script that deletes all files older than
x days that are stored in lib l in hdfs, what is the best way of doing that?

Regards Erik



S3 fs -ls is returning incorrect data

2008-11-14 Thread Josh Ferguson
Hi all, I'm pretty new to hadoop and I was testing out using S3 as a  
backend store for a few things and I am having a problem with a few  
of the filesystem commands. I've been looking around the web for a  
while and I couldn't sort it out but I'm sure it's a just a newb  
problem.


A little information
1) I'm using version 0.18.2 on java 1.5.0_16 (the latest for OSX 10.4).
2) -get and -put work just fine and I can transfer data in and out  
and run basic map-reduce tasks with it.


The error is as follow:

$ bin/hadoop fs -ls /
Found 2 items
drwxrwxrwx   - ls: -0s
Usage: java FsShell [-ls path]

If I use an absolute path (ie: s3://mybucket/) I get the same problem.

Any help is appreciated thanks.

Josh