Re: :!

2009-08-03 Thread A BlueCoder
unsubscribe

On Mon, Aug 3, 2009 at 12:01 AM, Sugandha Naolekar
sugandha@gmail.comwrote:

 dats fine. But, if I place the data in HDFS and then run map reduce code to
 provide compression, then the data will get compressed in sequence files
 but, even the original data will reside in the memory;thereby leading or
 causing a kind of redundancy of data...

 Can u pls suggest me a way out?/

 On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi 
 prashullega...@gmail.com wrote:

  I don't think you will be able to compress some data unless it's on HDFS.
  What you can do is
  1. Manually compress the data on the machine where the data resides.
 Then,
  copy the same to
   HDFS. or
  2. Copy the data without compressing to HDFS, then run a job which just
  emits the data as it reads
   in key/value pair. You can set
  FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
   that output gets gzipped.
 
  Does that solve your problem?
 
  btw you didn't exactly specify your data size (how many TBs).
 
  On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
  sugandha@gmail.comwrote:
 
   Yes, You are right. Here goes the details related::
  
   - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine,
  which
   is not a part of the hadoop cluster.
   - I want to place the data of that machine into the HDFS. Thus, before
   placing it in HDFS, I want to compress it, and then dump in the HDFS.
   - I have 4 datanodes in my cluster. also, data might get extended upto
   tera
   bytes.
   - Also, i have set thr replication factor as 2.
   - I guess, for compression, I will have to run map reduce...?
   right..please
   tel me the complete approach that is needed to be followed.
  
   On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi 
   prashullega...@gmail.com wrote:
  
By I want to compress the data first and then place it in HDFS, do
  you
mean you want to compress the data
locally and then copy to DFS?
   
What's the size of your data? What's the capacity of HDFS?
   
On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
sugandha@gmail.comwrote:
   
 I want to compress the data first and then place it in HDFS. Again,
   while
 retrieving the same, I want to uncompress it and place on the
 desired
 destination. Can this be possible. How to get started? Also, I want
  to
get
 started with actual coding part of compression and MAP reduce.
 PLease
 suggest me aptly...!



 --
 Regards!
 Sugandha

   
  
  
  
   --
   Regards!
   Sugandha
  
 



 --
 Regards!
 Sugandha



Re: :!

2009-08-03 Thread prashant ullegaddi
How files are written can be controlled. Maybe you are using
SequenceFileOutputFormat.
You can setOutputFormat() to TextOutputFormat.

I guess, this must solve your problem!

On Mon, Aug 3, 2009 at 12:31 PM, Sugandha Naolekar
sugandha@gmail.comwrote:

 dats fine. But, if I place the data in HDFS and then run map reduce code to
 provide compression, then the data will get compressed in sequence files
 but, even the original data will reside in the memory;thereby leading or
 causing a kind of redundancy of data...

 Can u pls suggest me a way out?/

 On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi 
 prashullega...@gmail.com wrote:

  I don't think you will be able to compress some data unless it's on HDFS.
  What you can do is
  1. Manually compress the data on the machine where the data resides.
 Then,
  copy the same to
   HDFS. or
  2. Copy the data without compressing to HDFS, then run a job which just
  emits the data as it reads
   in key/value pair. You can set
  FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
   that output gets gzipped.
 
  Does that solve your problem?
 
  btw you didn't exactly specify your data size (how many TBs).
 
  On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
  sugandha@gmail.comwrote:
 
   Yes, You are right. Here goes the details related::
  
   - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine,
  which
   is not a part of the hadoop cluster.
   - I want to place the data of that machine into the HDFS. Thus, before
   placing it in HDFS, I want to compress it, and then dump in the HDFS.
   - I have 4 datanodes in my cluster. also, data might get extended upto
   tera
   bytes.
   - Also, i have set thr replication factor as 2.
   - I guess, for compression, I will have to run map reduce...?
   right..please
   tel me the complete approach that is needed to be followed.
  
   On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi 
   prashullega...@gmail.com wrote:
  
By I want to compress the data first and then place it in HDFS, do
  you
mean you want to compress the data
locally and then copy to DFS?
   
What's the size of your data? What's the capacity of HDFS?
   
On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
sugandha@gmail.comwrote:
   
 I want to compress the data first and then place it in HDFS. Again,
   while
 retrieving the same, I want to uncompress it and place on the
 desired
 destination. Can this be possible. How to get started? Also, I want
  to
get
 started with actual coding part of compression and MAP reduce.
 PLease
 suggest me aptly...!



 --
 Regards!
 Sugandha

   
  
  
  
   --
   Regards!
   Sugandha
  
 



 --
 Regards!
 Sugandha



RE: :!

2009-08-03 Thread Amogh Vasekar

Maybe I'm missing the point, but in terms of execution performance benefit, 
what does copying to dfs and then compressing to be fed to a map/reduce job 
provide? Isn't it better to compress offline / outside latency window and 
make available on dfs?
Also, your mapreduce program will launch one map task per compressed file, so 
make sure you design your compression accordingly.

Thanks,
Amogh
-Original Message-
From: Sugandha Naolekar [mailto:sugandha@gmail.com] 
Sent: Monday, August 03, 2009 12:32 PM
To: common-user@hadoop.apache.org
Subject: Re: :!

dats fine. But, if I place the data in HDFS and then run map reduce code to
provide compression, then the data will get compressed in sequence files
but, even the original data will reside in the memory;thereby leading or
causing a kind of redundancy of data...

Can u pls suggest me a way out?/

On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi 
prashullega...@gmail.com wrote:

 I don't think you will be able to compress some data unless it's on HDFS.
 What you can do is
 1. Manually compress the data on the machine where the data resides. Then,
 copy the same to
  HDFS. or
 2. Copy the data without compressing to HDFS, then run a job which just
 emits the data as it reads
  in key/value pair. You can set
 FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
  that output gets gzipped.

 Does that solve your problem?

 btw you didn't exactly specify your data size (how many TBs).

 On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
 sugandha@gmail.comwrote:

  Yes, You are right. Here goes the details related::
 
  - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine,
 which
  is not a part of the hadoop cluster.
  - I want to place the data of that machine into the HDFS. Thus, before
  placing it in HDFS, I want to compress it, and then dump in the HDFS.
  - I have 4 datanodes in my cluster. also, data might get extended upto
  tera
  bytes.
  - Also, i have set thr replication factor as 2.
  - I guess, for compression, I will have to run map reduce...?
  right..please
  tel me the complete approach that is needed to be followed.
 
  On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi 
  prashullega...@gmail.com wrote:
 
   By I want to compress the data first and then place it in HDFS, do
 you
   mean you want to compress the data
   locally and then copy to DFS?
  
   What's the size of your data? What's the capacity of HDFS?
  
   On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
   sugandha@gmail.comwrote:
  
I want to compress the data first and then place it in HDFS. Again,
  while
retrieving the same, I want to uncompress it and place on the desired
destination. Can this be possible. How to get started? Also, I want
 to
   get
started with actual coding part of compression and MAP reduce. PLease
suggest me aptly...!
   
   
   
--
Regards!
Sugandha
   
  
 
 
 
  --
  Regards!
  Sugandha
 




-- 
Regards!
Sugandha


Re: :!

2009-08-03 Thread Vibhooti Verma
In my opinion it is best to compress it outside and then copy to HDFS. IN
case you want to compress while copying the files to HDFS, you can make use
of  GZIPOutputStream to open a the file and write content to it . This will
be compressed automatically.


On Mon, Aug 3, 2009 at 12:48 PM, Amogh Vasekar am...@yahoo-inc.com wrote:


 Maybe I'm missing the point, but in terms of execution performance benefit,
 what does copying to dfs and then compressing to be fed to a map/reduce job
 provide? Isn't it better to compress offline / outside latency window and
 make available on dfs?
 Also, your mapreduce program will launch one map task per compressed file,
 so make sure you design your compression accordingly.

 Thanks,
 Amogh
 -Original Message-
 From: Sugandha Naolekar [mailto:sugandha@gmail.com]
 Sent: Monday, August 03, 2009 12:32 PM
 To: common-user@hadoop.apache.org
 Subject: Re: :!

 dats fine. But, if I place the data in HDFS and then run map reduce code to
 provide compression, then the data will get compressed in sequence files
 but, even the original data will reside in the memory;thereby leading or
 causing a kind of redundancy of data...

 Can u pls suggest me a way out?/

 On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi 
 prashullega...@gmail.com wrote:

  I don't think you will be able to compress some data unless it's on HDFS.
  What you can do is
  1. Manually compress the data on the machine where the data resides.
 Then,
  copy the same to
   HDFS. or
  2. Copy the data without compressing to HDFS, then run a job which just
  emits the data as it reads
   in key/value pair. You can set
  FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
   that output gets gzipped.
 
  Does that solve your problem?
 
  btw you didn't exactly specify your data size (how many TBs).
 
  On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
  sugandha@gmail.comwrote:
 
   Yes, You are right. Here goes the details related::
  
   - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine,
  which
   is not a part of the hadoop cluster.
   - I want to place the data of that machine into the HDFS. Thus, before
   placing it in HDFS, I want to compress it, and then dump in the HDFS.
   - I have 4 datanodes in my cluster. also, data might get extended upto
   tera
   bytes.
   - Also, i have set thr replication factor as 2.
   - I guess, for compression, I will have to run map reduce...?
   right..please
   tel me the complete approach that is needed to be followed.
  
   On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi 
   prashullega...@gmail.com wrote:
  
By I want to compress the data first and then place it in HDFS, do
  you
mean you want to compress the data
locally and then copy to DFS?
   
What's the size of your data? What's the capacity of HDFS?
   
On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
sugandha@gmail.comwrote:
   
 I want to compress the data first and then place it in HDFS. Again,
   while
 retrieving the same, I want to uncompress it and place on the
 desired
 destination. Can this be possible. How to get started? Also, I want
  to
get
 started with actual coding part of compression and MAP reduce.
 PLease
 suggest me aptly...!



 --
 Regards!
 Sugandha

   
  
  
  
   --
   Regards!
   Sugandha
  
 



 --
 Regards!
 Sugandha




-- 
cheers,
Vibhooti


Re: MapFile performance

2009-08-03 Thread Tom White
On Mon, Aug 3, 2009 at 3:09 AM, Billy
Pearsonbilly_pear...@sbcglobal.net wrote:


 not sure if its still there but there was a parm in the hadoop-site conf
 file that would allow you to skip x number if index when reading it in to
 memory.

This is io.map.index.skip (default 0), which will skip this number of
keys for every key in the index. For example, if set to 2, one third
of the keys will end up in memory.

 From what I understand we scan find the key offset just before the data and
 seek once and read until we find the key.

 Billy


 - Original Message - From: Andy Liu
 andyliu1227-re5jqeeqqe8avxtiumw...@public.gmane.org
 Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
 To: core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org
 Sent: Tuesday, July 28, 2009 7:53 AM
 Subject: MapFile performance


 I have a bunch of Map/Reduce jobs that process documents and writes the
 results out to a few MapFiles.  These MapFiles are subsequently searched
 in
 an interactive application.

 One problem I'm running into is that if the values in the MapFile data
 file
 are fairly large, lookup can be slow.  This is because the MapFile index
 only stores every 128th key by default (io.map.index.interval), and after
 the binary search it may have to scan/skip through up to 127 values (off
 of
 disk) before it finds the matching record.  I've tried
 io.map.index.interval
 = 1, which brings average get() times from 1200ms to 200ms, but at the
 cost
 of memory during runtime, which is undesirable.

 One possible solution is to have the MapFile index store every single
 key,
 offset pair.  Then MapFile.Reader, upon startup, would read every 128th
 key
 in memory.  MapFile.Reader.get() would behave the same way except instead
 of
 seeking through the values SequenceFile it would seek through the index
 SequenceFile until it finds the matching record, and then it can seek to
 the
 corresponding offset in the values.  I'm going off the assumption that
 it's
 much faster to scan through the index (small keys) than it is to scan
 through the values (large values).

 Or maybe the index can be some kind of disk-based btree or bdb-like
 implementation?

 Anybody encounter this problem before?

 Andy






Re: RE: No Space Left On Device though space is available

2009-08-03 Thread Mathias Herberts
no quota on the fs?

On Aug 3, 2009 7:13 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com
wrote:

No. These are production jobs which were working pretty fine and
suddenly, we started seeing these issues. And, if you see the error log,
the jobs are failing at the time of submission itself while copying the
application jar. And, when I see the client machine disk size and also
HDFS, it is only 60% full.

Thanks
Pallavi

-Original Message- From: prashant ullegaddi [mailto:
prashullega...@gmail.com] Sent: Monday...


Some issues!

2009-08-03 Thread Sugandha Naolekar
I want to encrypt the data that would be placed in HDFS. So I will have to
use some kind of encryption algorithms, right?
Also, This encryption is to be done on data before placing it in HDFS. How
this can be done? Any special API's available in HADOOP for the above
purpose?

-- 
Regards!
Sugandha


Re: Some issues!

2009-08-03 Thread Steve Loughran

Sugandha Naolekar wrote:

I want to encrypt the data that would be placed in HDFS. So I will have to
use some kind of encryption algorithms, right?
Also, This encryption is to be done on data before placing it in HDFS. How
this can be done? Any special API's available in HADOOP for the above
purpose?



1. Can I point you to the how to ask questions article, which 
emphasises the value in having meaningful titles

http://catb.org/~esr/faqs/smart-questions.html

2. no encryption layers in Hadoop -not much of any security, in fact. 
javax.crypto is what you have to play with






Re: Some issues!

2009-08-03 Thread Sugandha Naolekar
I am very sorry for the inconvenience caused. From next time, will take care
of the questions to be asked in a precise manner.

On Mon, Aug 3, 2009 at 3:58 PM, Steve Loughran ste...@apache.org wrote:

 Sugandha Naolekar wrote:

 I want to encrypt the data that would be placed in HDFS. So I will have to
 use some kind of encryption algorithms, right?
 Also, This encryption is to be done on data before placing it in HDFS. How
 this can be done? Any special API's available in HADOOP for the above
 purpose?


 1. Can I point you to the how to ask questions article, which emphasises
 the value in having meaningful titles
 http://catb.org/~esr/faqs/smart-questions.htmlhttp://catb.org/%7Eesr/faqs/smart-questions.html

 2. no encryption layers in Hadoop -not much of any security, in fact.
 javax.crypto is what you have to play with






-- 
Regards!
Sugandha


Compression related issues..!

2009-08-03 Thread Sugandha Naolekar
Hello!

I want to know - what's the difference between zipping a file(compressing)
and actually implementing compression algorithms for compressing some sort
of data?

How much difference does it make and which one is preferable.

I want to compress data to be placed in HDFS.

Thanking You,

-- 
Regards!
Sugandha


Re: Counting no. of keys.

2009-08-03 Thread Enis Soztutar

prashant ullegaddi wrote:

Hi,

I've say 800 sequence files written using SequenceFileOutputFormat. Is there
any way to know
no. of unique keys in those sequence files?

Thanks,
Prashant.

  
You can use the counters map output records and reduce output 
records for this. If you can guarantee that every output key from 
reduce is unique, then the reduce output records is what you're looking 
for. If you're not using the reduce phase, then use map output records.


Re: Counting no. of keys.

2009-08-03 Thread Zhong Wang
I have the same question, but i want to use map records number in
reduce phase exactly after the map. This is very useful in solving
problems like TF-IDF. In reduce (IDF calculating) phase, you must know
the total number of all documents. Is there any method to solve the
problem without running two Map-Reduce jobs?

On Sun, Aug 2, 2009 at 2:08 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Sure.  Write a word count map-reduce program.  The mapper outputs the key
 from the sequence file as the output key and includes a count.  Then you do
 the normal combiner and reducer from a normal word count program.

 On Sat, Aug 1, 2009 at 9:53 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi,

 I've say 800 sequence files written using SequenceFileOutputFormat. Is
 there
 any way to know
 no. of unique keys in those sequence files?

 Thanks,
 Prashant.




 --
 Ted Dunning, CTO
 DeepDyve




-- 
Zhong Wang


Re: Difference between Killed Task Attempts and Killed Tasks

2009-08-03 Thread Enis Soztutar

Hi,

Task attempt is an attempt to a task. At any given time, one or 
more(speculative exec.) of task attempts can be running. For a task, 
there can be many attempts at different nodes. A task is complete if any 
of its attempts is complete.  For a task to be marked as failed all of 
mapred.map.max.attempts should fail. For every task in the job, a TaskID 
is assigned. For every attempt, a TaskAttemptID is assigned (which ends 
with _0, _1, etc).


Harish Mallipeddi wrote:

Hi,

Anyone can tell me what's the difference between Killed Task Attempts and
Killed Tasks? I ran a big job (14820 maps and 0 reduces). In the
job-details page, the web GUI reports 62 killed task attempts. I'm
assuming this is due to speculative execution. Now when I go to the
job-history page for the job, it reports 54 killed tasks (and 14820
successful map-tasks as expected).

A few questions:

* Why 62 killed task attempts vs 54 killed tasks?
* Under speculative execution, does hadoop launch a new MapTask with new
task-id or does it just launch a new MapTaskAttempt with a new
task-attempt-id?
* When a MapTaskAttempt fails, and when hadoop tries to re-launch the
MapTask, does it create a new task-id or just a new task-attempt-id?
* Does 'mapred.map.max.attempts' include all attempts launched due to
speculative-execution?

Btw this job is basically a trivial no-op job - it just scans around 1TB of
data and does nothing else in the map. I looked at the killed tasks' syslog
output and I didn't see any errors.

  




Re: Difference between Killed Task Attempts and Killed Tasks

2009-08-03 Thread Harish Mallipeddi
Agreed. But how did I manage to get 54 killed tasks vs 62 killed
task-attempts? I understand what a failed task is (a task for which
'mapred.map.max.attempts' attempts have failed). But what's a killed task?

On Mon, Aug 3, 2009 at 6:41 PM, Enis Soztutar enis@gmail.com wrote:

 Hi,

 Task attempt is an attempt to a task. At any given time, one or
 more(speculative exec.) of task attempts can be running. For a task, there
 can be many attempts at different nodes. A task is complete if any of its
 attempts is complete.  For a task to be marked as failed all of
 mapred.map.max.attempts should fail. For every task in the job, a TaskID is
 assigned. For every attempt, a TaskAttemptID is assigned (which ends with
 _0, _1, etc).


 Harish Mallipeddi wrote:

 Hi,

 Anyone can tell me what's the difference between Killed Task Attempts
 and
 Killed Tasks? I ran a big job (14820 maps and 0 reduces). In the
 job-details page, the web GUI reports 62 killed task attempts. I'm
 assuming this is due to speculative execution. Now when I go to the
 job-history page for the job, it reports 54 killed tasks (and 14820
 successful map-tasks as expected).

 A few questions:

 * Why 62 killed task attempts vs 54 killed tasks?
 * Under speculative execution, does hadoop launch a new MapTask with new
 task-id or does it just launch a new MapTaskAttempt with a new
 task-attempt-id?
 * When a MapTaskAttempt fails, and when hadoop tries to re-launch the
 MapTask, does it create a new task-id or just a new task-attempt-id?
 * Does 'mapred.map.max.attempts' include all attempts launched due to
 speculative-execution?

 Btw this job is basically a trivial no-op job - it just scans around 1TB
 of
 data and does nothing else in the map. I looked at the killed tasks'
 syslog
 output and I didn't see any errors.







-- 
Harish Mallipeddi
http://blog.poundbang.in


Re: Status of 0.19.2

2009-08-03 Thread Tom White
I've now updated the news section, and the documentation on the
website to reflect the 0.19.2 release.

There were several reports of it being more stable than 0.19.1 in the
voting thread: 
http://www.mail-archive.com/common-...@hadoop.apache.org/msg00051.html

Cheers,
Tom

On Tue, Jul 28, 2009 at 12:37 PM, Tamir Kamara tamirkam...@gmail.com wrote:

 Hi,

 I've seen that the 0.19.2 version was added recently to the downloads but
 there's no entry under the news section.
 Is it stable enough for deployment?

 Thanks,
 Tamir


FYI X-RIME: Hadoop based large scale social network analysis released

2009-08-03 Thread Bin Cai
*X-RIM**E**(http://xrime.sourceforge.net/): Hadoop based large scale social
network analysis*
*
Motivation*
Today's telecom service providers and Internet-based social network sites
possess huge user communities. They hold large amount of data about their
users and want to generate core competency from the data. A key enabler for
this is a cost efficient solution for social data management and social
network analysis (SNA).

Such a solution faces a few challenges. The most important one is that the
solution should be able to handle massive and heterogeneous data sets.
Facing this challenge, the traditional data warehouse based solutions are
usually not cost efficient enough. On the other hand, existing SNA tools are
mostly used in single workstation mode, and not scalable enough. To this
end, low cost and highly scalable data management and processing
technologies from cloud computing society should be brought in to help.

However, most of existing cloud based data analysis solutions are trying to
provide SQL-like general purpose query languages, and do not directly
support social network analysis. This makes them hard to optimize and hard
to use for SNA users. So, we came up with X-RIME to fix this gap.

So, briefly speaking, X-RIME wants to provide a few value-added layers on
top of existing cloud infrastructure, to support smart decision loops based
on massive data sets and SNA. To end users, X-RIME is a library consists of
Map-Reduce programs, which are used to do raw data pre-processing,
transformation, SNA metrics and structures calculation, and graph / network
visualization. The library could be integrated with other Hadoop based data
warehouses (e.g., HIVE) to build more comprehensive solutions.

*Currently Supported SNA Metrics and Structures*
vertex degree statistics
weakly connected components (WCC)
strongly connected components (SCC)
bi-connected components (BCC)
ego-centric density
bread first search / single source shortest path (BFS/SSSP)
K-core
maximal cliques
pagerank
hyperlink-induced topic search (HITS)
minimal spanning tree (MST)


Job status task attempt 120%:

2009-08-03 Thread joerg . schad
Hi, 
when I check my running jobs via the jobtracker web interface I see that one 
task attempt is at 120% .
Is there a logical explanation?
Thanks



Task process exit with nonzero status of 255

2009-08-03 Thread Mathias De Maré
I'm getting a rather cryptic error while running a Map job with
MultithreadedMapper (no idea if it has anything to do with the
MultithreadedMapper).
It only occurs sometimes, occurs at different times during the Map
(sometimes at the start, sometimes at a random location), and it doesn't
really give any information.

Task Id : attempt_200908031207_0009_m_00_0, Status : FAILED
java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

I'm a bit at a loss. The only results I'm finding from Google are a few
other people who have had the same problem, but nobody with a solution.

Mathias De Maré


Re: Too many open files error, which gets resolved after some time

2009-08-03 Thread Raghu Angadi
For writes, there is an extra thread waiting on i/o. So it would be 3 
fds more. To simplify earlier equation, on the client side :


for writes :  max fds (for io bound load) = 7 * #write_streams
for reads  :  max fds (for io bound load) = 4 * #read_streams

The main socket is cleared as soon as you close the stream.
The rest of fds stay for 10 sec (they get reused if you open more 
streams meanwhile).


I hope this is clear enough.

Raghu.

Stas Oskin wrote:

Hi.

I'd like to raise this issue once again, just to clarify a point.

If I have only one thread writing to HDFS, the amount of fd's should be 4,
resulting from:

1) input
2) output
3) epoll
4) stream itself

And these 4 fds should be cleared out after 10 seconds.

Is this correct?

Thanks in advance for the information!

2009/6/24 Stas Oskin stas.os...@gmail.com


Hi.

So if I open one stream, it should be 4?



2009/6/23 Raghu Angadi rang...@yahoo-inc.com


how many threads do you have? Number of active threads is very important.
Normally,

#fds = (3 * #threads_blocked_on_io) + #streams

12 per stream is certainly way off.

Raghu.


Stas Oskin wrote:


Hi.

In my case it was actually ~ 12 fd's per stream, which included pipes and
epolls.

Could it be that HDFS opens 3 x 3 (input - output - epoll) fd's per each
thread, which make it close to the number I mentioned? Or it always 3 at
maximum per thread / stream?

Up to 10 sec looks quite the correct number, it seems it gets freed
arround
this time indeed.

Regards.

2009/6/23 Raghu Angadi rang...@yahoo-inc.com

 To be more accurate, once you have HADOOP-4346,

fds for epoll and pipes = 3 * threads blocked on Hadoop I/O

Unless you have hundreds of threads at a time, you should not see
hundreds
of these. These fds stay up to 10sec even after the
threads exit.

I am a bit confused about your exact situation. Please check number of
threads if you still facing the problem.

Raghu.


Raghu Angadi wrote:

 since you have HADOOP-4346, you should not have excessive epoll/pipe

fds
open. First of all do you still have the problem? If yes, how many
hadoop
streams do you have at a time?

System.gc() won't help if you have HADOOP-4346.

Ragu.

 Thanks for your opinion!


2009/6/22 Stas Oskin stas.os...@gmail.com

 Ok, seems this issue is already patched in the Hadoop distro I'm
using


(Cloudera).

Any idea if I still should call GC manually/periodically to clean out
all
the stale pipes / epolls?

2009/6/22 Steve Loughran ste...@apache.org

 Stas Oskin wrote:


 Hi.

 So what would be the recommended approach to pre-0.20.x series?

To insure each file is used only by one thread, and then it safe to
close
the handle in that thread?

Regards.

 good question -I'm not sure. For anythiong you get with


FileSystem.get(),
its now dangerous to close, so try just setting the reference to
null
and
hoping that GC will do the finalize() when needed









Re: :!

2009-08-03 Thread Brian Bockelman

Hey Sugandha,

It's a common mistake - I think he was trying to unsubscribe to the  
mailing list (which is done by sending a message to a specific email  
address with the command unsubscribe), not telling you to unsubscribe.


Brian

On Aug 3, 2009, at 2:09 AM, Sugandha Naolekar wrote:

This is ridiculous. What do you mean by unsubscribe.?? I have few  
queries

and dats why have logged in to the corresponding forum.

On Mon, Aug 3, 2009 at 12:33 PM, A BlueCoder  
bluecoder...@gmail.com wrote:



unsubscribe

On Mon, Aug 3, 2009 at 12:01 AM, Sugandha Naolekar
sugandha@gmail.comwrote:

dats fine. But, if I place the data in HDFS and then run map  
reduce code

to
provide compression, then the data will get compressed in sequence  
files
but, even the original data will reside in the memory;thereby  
leading or

causing a kind of redundancy of data...

Can u pls suggest me a way out?/

On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi 
prashullega...@gmail.com wrote:


I don't think you will be able to compress some data unless it's on

HDFS.

What you can do is
1. Manually compress the data on the machine where the data  
resides.

Then,

copy the same to
HDFS. or
2. Copy the data without compressing to HDFS, then run a job  
which just

emits the data as it reads
in key/value pair. You can set
FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so
that output gets gzipped.

Does that solve your problem?

btw you didn't exactly specify your data size (how many TBs).

On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar
sugandha@gmail.comwrote:


Yes, You are right. Here goes the details related::

- I have a Hadoop cluster of 7 nodes. Now there is this 8th  
machine,

which

is not a part of the hadoop cluster.
- I want to place the data of that machine into the HDFS. Thus,

before
placing it in HDFS, I want to compress it, and then dump in the  
HDFS.

- I have 4 datanodes in my cluster. also, data might get extended

upto

tera
bytes.
- Also, i have set thr replication factor as 2.
- I guess, for compression, I will have to run map reduce...?
right..please
tel me the complete approach that is needed to be followed.

On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi 
prashullega...@gmail.com wrote:


By I want to compress the data first and then place it in HDFS,

do

you

mean you want to compress the data
locally and then copy to DFS?

What's the size of your data? What's the capacity of HDFS?

On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar
sugandha@gmail.comwrote:


I want to compress the data first and then place it in HDFS.

Again,

while

retrieving the same, I want to uncompress it and place on the

desired

destination. Can this be possible. How to get started? Also, I

want

to

get

started with actual coding part of compression and MAP reduce.

PLease

suggest me aptly...!



--
Regards!
Sugandha







--
Regards!
Sugandha







--
Regards!
Sugandha







--
Regards!
Sugandha




Re: Task process exit with nonzero status of 255

2009-08-03 Thread Mathias De Maré
Thanks! Because of your information, I managed to find out that my crashes
have to do with the somewhat standard 'Too many open files'.
I added some 'close' and 'disconnect' to my InputStreams and
HttpURLConnections, but that only works up to a certain point.
The strange thing is that I do execute 'ulimit -n 32768', but I'm thinking
it somehow 'won't' stick.
The other alternative is that Hadoop is using 32768 file descriptors, but
that seems a bit over the top, as I'm using at most 200 threads (each of
which only sets up one HttpURLConnection).

Suggestions are welcome :-)

On Mon, Aug 3, 2009 at 3:58 PM, Jason Venner jason.had...@gmail.com wrote:

 That generally means that the process that is running the task, crashed.
 The actual map/reduce task is run in a separate jvm by the task tracker,
 and that JVM is exiting abnormally.
 This used to happen to my jobs quite a bit when they were using a buggy
 native library via jni.

 If you are trying to use the colorspace transforms via the java imaging
 APIs, it is not thread safe (at least through 1.6.10 under linux).

 There may be additional information available in the per task logs.

 2009/8/3 Mathias De Maré mathias.dem...@gmail.com

 I'm getting a rather cryptic error while running a Map job with
 MultithreadedMapper (no idea if it has anything to do with the
 MultithreadedMapper).
 It only occurs sometimes, occurs at different times during the Map
 (sometimes at the start, sometimes at a random location), and it doesn't
 really give any information.

 Task Id : attempt_200908031207_0009_m_00_0, Status : FAILED
 java.io.IOException: Task process exit with nonzero status of 255.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418)

 I'm a bit at a loss. The only results I'm finding from Google are a few
 other people who have had the same problem, but nobody with a solution.

 Mathias De Maré




 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



namenode -upgrade problem

2009-08-03 Thread bharath vissapragada
Hi all ,

I have noticed some problem in my cluster when i changed the hadoop version
on the same DFS directory .. The namenode log on the master says the
following ..


ile system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
2009-08-04 00:27:51,498 INFO org.apache.hadoop.ipc.Server: Stopping server
on 54310
2009-08-04 00:27:51,498 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
File system image contains an old layout version -16.
An upgrade to version -18 is required.
Please restart NameNode with -upgrade option.
at
org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
at
org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)

2009-08-04 00:27:51,499 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG

Can anyone explain me the reason ... i googled it .. but those explanations
weren't quite useful

Thanks


Re: Too many open files error, which gets resolved after some time

2009-08-03 Thread Raghu Angadi

Stas Oskin wrote:

Hi.

Thanks for the explanation.

Just to clarify, the extra thread waiting on writing, is happens in
multi-threading as well?

Meaning if I have 10 writing threads for example, it would be actually 70
fd's?


unfortunately, yes.

There are different proposals to fix this : async I/O in Hadoop, RPCs 
for data transfers.


It is not just the fds, the applications that hit fd limits hit thread 
limits as well. Obviously Hadoop can not sustain this as the range of 
applications increases. It will be fixed one way or the other.


Raghu.


Regards.

2009/8/3 Raghu Angadi rang...@yahoo-inc.com


For writes, there is an extra thread waiting on i/o. So it would be 3 fds
more. To simplify earlier equation, on the client side :

for writes :  max fds (for io bound load) = 7 * #write_streams
for reads  :  max fds (for io bound load) = 4 * #read_streams

The main socket is cleared as soon as you close the stream.
The rest of fds stay for 10 sec (they get reused if you open more streams
meanwhile).

I hope this is clear enough.


Raghu.

Stas Oskin wrote:


Hi.

I'd like to raise this issue once again, just to clarify a point.

If I have only one thread writing to HDFS, the amount of fd's should be 4,
resulting from:

1) input
2) output
3) epoll
4) stream itself

And these 4 fds should be cleared out after 10 seconds.

Is this correct?

Thanks in advance for the information!

2009/6/24 Stas Oskin stas.os...@gmail.com

 Hi.

So if I open one stream, it should be 4?



2009/6/23 Raghu Angadi rang...@yahoo-inc.com

 how many threads do you have? Number of active threads is very

important.
Normally,

#fds = (3 * #threads_blocked_on_io) + #streams

12 per stream is certainly way off.

Raghu.


Stas Oskin wrote:

 Hi.

In my case it was actually ~ 12 fd's per stream, which included pipes
and
epolls.

Could it be that HDFS opens 3 x 3 (input - output - epoll) fd's per
each
thread, which make it close to the number I mentioned? Or it always 3
at
maximum per thread / stream?

Up to 10 sec looks quite the correct number, it seems it gets freed
arround
this time indeed.

Regards.

2009/6/23 Raghu Angadi rang...@yahoo-inc.com

 To be more accurate, once you have HADOOP-4346,


fds for epoll and pipes = 3 * threads blocked on Hadoop I/O

Unless you have hundreds of threads at a time, you should not see
hundreds
of these. These fds stay up to 10sec even after the
threads exit.

I am a bit confused about your exact situation. Please check number of
threads if you still facing the problem.

Raghu.


Raghu Angadi wrote:

 since you have HADOOP-4346, you should not have excessive epoll/pipe


fds
open. First of all do you still have the problem? If yes, how many
hadoop
streams do you have at a time?

System.gc() won't help if you have HADOOP-4346.

Ragu.

 Thanks for your opinion!

 2009/6/22 Stas Oskin stas.os...@gmail.com

 Ok, seems this issue is already patched in the Hadoop distro I'm
using

 (Cloudera).

Any idea if I still should call GC manually/periodically to clean
out
all
the stale pipes / epolls?

2009/6/22 Steve Loughran ste...@apache.org

 Stas Oskin wrote:

  Hi.

 So what would be the recommended approach to pre-0.20.x series?


To insure each file is used only by one thread, and then it safe
to
close
the handle in that thread?

Regards.

 good question -I'm not sure. For anythiong you get with

 FileSystem.get(),

its now dangerous to close, so try just setting the reference to
null
and
hoping that GC will do the finalize() when needed










Problem getting scheduler to work.

2009-08-03 Thread Erik Holstad
Hi!
I'm testing out the FairScheduler and I'm getting it to start and the Pools
that I've defined in the pools.xml file shows up and everything.
But when trying to submit a job, I don't really know where to put the name
of the pool to use for the job. All the examples that I've seen are
using JobConf and I'm currently on 0.20. I tried to put the name on the
Configuration like:

conf.set(mapred.job.queue.name, fast);
but just getting

org.apache.hadoop.ipc.RemoteException: java.io.IOException: Queue fast
does not exist

So, how and where to I set the pool to use for the individual jobs?
Erik


Re: Too many open files error, which gets resolved after some time

2009-08-03 Thread Stas Oskin
Hi Raghu.

Thanks for the clarification and for explaining the potential issue.

It is not just the fds, the applications that hit fd limits hit thread
 limits as well. Obviously Hadoop can not sustain this as the range of
 applications increases. It will be fixed one way or the other.


Can you please clarify the thread limit matter?

AFAIK it only happens if the allocated stack too large, and we speak about
thousands of threads ( a possible solution described here:
http://candrews.integralblue.com/2009/01/preventing-outofmemoryerror-native-thread/
).

So how it's tied to fd's?

Thanks.


Hadoop BootCamp in Berlin Aug 27, 28th (reminder)

2009-08-03 Thread Chris K Wensel


Hi all,

A quick reminder that Scale Unlimited will run a 2 day Hadoop BootCamp  
in Berlin on August 27th and 28th.


This 2 day course is for managers and developers who want to quickly  
become experienced with Hadoop and related technologies.


The BootCamp provides training in MapReduce Theory, Hadoop  
Architecture, configuration, and API's through our hands-on labs.


All our courses are taught by practitioners with years of Hadoop and  
related experience in large data architectures.


** Professional independent consultants may take this course for free,  
please email i...@scaleunlimited.com to inquire.

http://www.scaleunlimited.com/courses/programs

Detailed information and registration information is at:

  http://www.scaleunlimited.com/courses/berlin08 (german) or
  http://www.scaleunlimited.com/courses/hadoop-boot-camp-berlin-en  
(english)


cheers,
chris

P.S Apologies for the cross posting.
P.P.S. Please spread the word!

~~~
Hadoop training and consulting
http://www.scaleunlimited.com


Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Onur AKTAS

Hi,

I'm having troubles with running Hadoop in RHEL 5, I did everything as 
documented in: 
http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html

And configured:
conf/core-site.xml, conf/hdfs-site.xml, 
conf/mapred-site.xml.

Connected to localhost with ssh (did passphrase stuff etc.), then I did the 
following:

$ bin/hadoop namenode -format
$ bin/start-all.sh 
starting namenode, logging to 
/hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
localhost: starting datanode, logging to 
/hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to 
/hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to 
/hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
localhost: starting tasktracker, logging to 
/hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out

Everything seems ok, but when I check the Hadoop Logs I see many errors. (and 
they all cause HBase connection problems.)
How can I solve this problem? Here are the Logs

 hadoop-oracle-datanode-localhost.localdomain.log:
2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
STARTUP_MSG: 
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.0
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; 
compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
/
2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: 
java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: 
namenode namespaceID = 36527197; datanode namespaceID = 2138759529
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at 
org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at 
org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
SHUTDOWN_MSG: 
/
SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1
/
--
hadoop-oracle-namenode-localhost.localdomain.log
2009-08-04 02:54:26,987 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
STARTUP_MSG: 
/
STARTUP_MSG: Starting NameNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.0
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; 
compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
/
2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: 
Initializing RPC Metrics with hostName=NameNode, port=9000
2009-08-04 02:54:27,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: 
Namenode up at: localhost.localdomain/127.0.0.1:9000
2009-08-04 02:54:27,179 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=NameNode, sessionId=null
2009-08-04 02:54:27,180 INFO 
org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing 
NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext
2009-08-04 02:54:27,278 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
fsOwner=oracle,oinstall,root,dba,oper,asmadmin
2009-08-04 02:54:27,278 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
2009-08-04 02:54:27,278 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true
2009-08-04 02:54:27,294 INFO 
org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: 
Initializing FSNamesystemMetrics using context 
object:org.apache.hadoop.metrics.spi.NullContext
2009-08-04 02:54:27,297 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered 
FSNamesystemStatusMBean
2009-08-04 02:54:27,341 INFO org.apache.hadoop.hdfs.server.common.Storage: 
Number of files = 8
2009-08-04 

Re: Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Amandeep Khurana
I'm assuming that you have no data in HDFS since it never came up... So, go
ahead and clean up the directory where you are storing the datanode's data
and the namenode's metadata. After that format the namenode and restart
hadoop.


2009/8/3 Onur AKTAS onur.ak...@live.com


 Hi,

 I'm having troubles with running Hadoop in RHEL 5, I did everything as
 documented in:
 http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html

 And configured:
 conf/core-site.xml, conf/hdfs-site.xml,
 conf/mapred-site.xml.

 Connected to localhost with ssh (did passphrase stuff etc.), then I did
 the following:

 $ bin/hadoop namenode -format
 $ bin/start-all.sh
 starting namenode, logging to
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
 localhost: starting datanode, logging to
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
 localhost: starting secondarynamenode, logging to
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
 starting jobtracker, logging to
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
 localhost: starting tasktracker, logging to
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out

 Everything seems ok, but when I check the Hadoop Logs I see many errors.
 (and they all cause HBase connection problems.)
 How can I solve this problem? Here are the Logs

  hadoop-oracle-datanode-localhost.localdomain.log:
 2009-08-04 02:54:28,971 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting DataNode
 STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
 763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
 /
 2009-08-04 02:54:29,562 ERROR
 org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
 Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode
 namespaceID = 36527197; datanode namespaceID = 2138759529
at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
at
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
at
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)

 2009-08-04 02:54:29,563 INFO
 org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
 /
 SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1
 /

 --
 hadoop-oracle-namenode-localhost.localdomain.log
 2009-08-04 02:54:26,987 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
 /
 STARTUP_MSG: Starting NameNode
 STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
 STARTUP_MSG:   args = []
 STARTUP_MSG:   version = 0.20.0
 STARTUP_MSG:   build =
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
 763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
 /
 2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
 Initializing RPC Metrics with hostName=NameNode, port=9000
 2009-08-04 02:54:27,174 INFO
 org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
 localhost.localdomain/127.0.0.1:9000
 2009-08-04 02:54:27,179 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2009-08-04 02:54:27,180 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing
 NameNodeMeterics using context
 object:org.apache.hadoop.metrics.spi.NullContext
 2009-08-04 02:54:27,278 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 fsOwner=oracle,oinstall,root,dba,oper,asmadmin
 2009-08-04 02:54:27,278 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup
 2009-08-04 02:54:27,278 INFO
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem:
 isPermissionEnabled=true
 2009-08-04 02:54:27,294 INFO
 org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
 

Re: Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Amandeep Khurana
Yes, you need to change these directories. The config is put in the
hadoop-site.xml. Or in this case, separately in the 3 xmls. See the
default xml for syntax and property name.

On 8/3/09, Onur AKTAS onur.ak...@live.com wrote:

 Is it the directory that Hadoop uses?

 /tmp/hadoop-oracle
 /tmp/hadoop-oracle/dfs/
 /tmp/hadoop-oracle/mapred/

 If yes, how can I change the directory to anywhere else? I do not want it to
 be kept in /tmp folder.

 From: ama...@gmail.com
 Date: Mon, 3 Aug 2009 17:02:50 -0700
 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
 To: common-user@hadoop.apache.org

 I'm assuming that you have no data in HDFS since it never came up... So,
 go
 ahead and clean up the directory where you are storing the datanode's data
 and the namenode's metadata. After that format the namenode and restart
 hadoop.


 2009/8/3 Onur AKTAS onur.ak...@live.com

 
  Hi,
 
  I'm having troubles with running Hadoop in RHEL 5, I did everything as
  documented in:
  http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html
 
  And configured:
  conf/core-site.xml, conf/hdfs-site.xml,
  conf/mapred-site.xml.
 
  Connected to localhost with ssh (did passphrase stuff etc.), then I
  did
  the following:
 
  $ bin/hadoop namenode -format
  $ bin/start-all.sh
  starting namenode, logging to
  /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
  localhost: starting datanode, logging to
  /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
  localhost: starting secondarynamenode, logging to
  /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
  starting jobtracker, logging to
  /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
  localhost: starting tasktracker, logging to
  /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out
 
  Everything seems ok, but when I check the Hadoop Logs I see many errors.
  (and they all cause HBase connection problems.)
  How can I solve this problem? Here are the Logs
 
   hadoop-oracle-datanode-localhost.localdomain.log:
  2009-08-04 02:54:28,971 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.0
  STARTUP_MSG:   build =
  https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
  763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
  /
  2009-08-04 02:54:29,562 ERROR
  org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
  Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode
  namespaceID = 36527197; datanode namespaceID = 2138759529
 at
  org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
 at
  org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
 at
  org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
 
  2009-08-04 02:54:29,563 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
  /
  SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1
  /
 
  --
  hadoop-oracle-namenode-localhost.localdomain.log
  2009-08-04 02:54:26,987 INFO
  org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting NameNode
  STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.0
  STARTUP_MSG:   build =
  https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
  763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
  /
  2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
  Initializing RPC Metrics with hostName=NameNode, port=9000
  2009-08-04 02:54:27,174 INFO
  org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
  localhost.localdomain/127.0.0.1:9000
  2009-08-04 02:54:27,179 INFO 

RE: Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Onur AKTAS

There is no default.xml in Hadoop 0.20.0, but luckly I also have release 0.18.3 
and found these..

property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
/property

It seems /tmp/hadoop-${user.name} is a temporary directory as description 
indicates, then where is the real directory?
I deleted whole tmp directory and formatted it again.. Started the server, 
checked the logs and still have same errors. 

 Date: Mon, 3 Aug 2009 17:29:52 -0700
 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
 From: ama...@gmail.com
 To: common-user@hadoop.apache.org
 
 Yes, you need to change these directories. The config is put in the
 hadoop-site.xml. Or in this case, separately in the 3 xmls. See the
 default xml for syntax and property name.
 
 On 8/3/09, Onur AKTAS onur.ak...@live.com wrote:
 
  Is it the directory that Hadoop uses?
 
  /tmp/hadoop-oracle
  /tmp/hadoop-oracle/dfs/
  /tmp/hadoop-oracle/mapred/
 
  If yes, how can I change the directory to anywhere else? I do not want it to
  be kept in /tmp folder.
 
  From: ama...@gmail.com
  Date: Mon, 3 Aug 2009 17:02:50 -0700
  Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
  To: common-user@hadoop.apache.org
 
  I'm assuming that you have no data in HDFS since it never came up... So,
  go
  ahead and clean up the directory where you are storing the datanode's data
  and the namenode's metadata. After that format the namenode and restart
  hadoop.
 
 
  2009/8/3 Onur AKTAS onur.ak...@live.com
 
  
   Hi,
  
   I'm having troubles with running Hadoop in RHEL 5, I did everything as
   documented in:
   http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html
  
   And configured:
   conf/core-site.xml, conf/hdfs-site.xml,
   conf/mapred-site.xml.
  
   Connected to localhost with ssh (did passphrase stuff etc.), then I
   did
   the following:
  
   $ bin/hadoop namenode -format
   $ bin/start-all.sh
   starting namenode, logging to
   /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
   localhost: starting datanode, logging to
   /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
   localhost: starting secondarynamenode, logging to
   /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
   starting jobtracker, logging to
   /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
   localhost: starting tasktracker, logging to
   /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out
  
   Everything seems ok, but when I check the Hadoop Logs I see many errors.
   (and they all cause HBase connection problems.)
   How can I solve this problem? Here are the Logs
  
hadoop-oracle-datanode-localhost.localdomain.log:
   2009-08-04 02:54:28,971 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
   /
   STARTUP_MSG: Starting DataNode
   STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
   STARTUP_MSG:   args = []
   STARTUP_MSG:   version = 0.20.0
   STARTUP_MSG:   build =
   https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r
   763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
   /
   2009-08-04 02:54:29,562 ERROR
   org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException:
   Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode
   namespaceID = 36527197; datanode namespaceID = 2138759529
  at
   org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
  at
   org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
  at
   org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
  
   2009-08-04 02:54:29,563 INFO
   org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG:
   /
   SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1
   /
  
   --
   hadoop-oracle-namenode-localhost.localdomain.log
   2009-08-04 02:54:26,987 INFO
   org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
   

Re: Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Amandeep Khurana
1. The default xmls are in $HADOOP_HOME/build/classes
2. You have to ovverride the parameters and put them in the site.xml's so
you can have it in some other directory and not /tmp

Do that and try starting hadoop.


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


2009/8/3 Onur AKTAS onur.ak...@live.com


 There is no default.xml in Hadoop 0.20.0, but luckly I also have release
 0.18.3 and found these..

 property
  namehadoop.tmp.dir/name
  value/tmp/hadoop-${user.name}/value
  descriptionA base for other temporary directories./description
 /property

 It seems /tmp/hadoop-${user.name} is a temporary directory as description
 indicates, then where is the real directory?
 I deleted whole tmp directory and formatted it again.. Started the server,
 checked the logs and still have same errors.

  Date: Mon, 3 Aug 2009 17:29:52 -0700
  Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
  From: ama...@gmail.com
  To: common-user@hadoop.apache.org
 
  Yes, you need to change these directories. The config is put in the
  hadoop-site.xml. Or in this case, separately in the 3 xmls. See the
  default xml for syntax and property name.
 
  On 8/3/09, Onur AKTAS onur.ak...@live.com wrote:
  
   Is it the directory that Hadoop uses?
  
   /tmp/hadoop-oracle
   /tmp/hadoop-oracle/dfs/
   /tmp/hadoop-oracle/mapred/
  
   If yes, how can I change the directory to anywhere else? I do not want
 it to
   be kept in /tmp folder.
  
   From: ama...@gmail.com
   Date: Mon, 3 Aug 2009 17:02:50 -0700
   Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
   To: common-user@hadoop.apache.org
  
   I'm assuming that you have no data in HDFS since it never came up...
 So,
   go
   ahead and clean up the directory where you are storing the datanode's
 data
   and the namenode's metadata. After that format the namenode and
 restart
   hadoop.
  
  
   2009/8/3 Onur AKTAS onur.ak...@live.com
  
   
Hi,
   
I'm having troubles with running Hadoop in RHEL 5, I did everything
 as
documented in:
http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html
   
And configured:
conf/core-site.xml, conf/hdfs-site.xml,
conf/mapred-site.xml.
   
Connected to localhost with ssh (did passphrase stuff etc.), then
 I
did
the following:
   
$ bin/hadoop namenode -format
$ bin/start-all.sh
starting namenode, logging to
   
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
localhost: starting datanode, logging to
   
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
localhost: starting secondarynamenode, logging to
   
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
starting jobtracker, logging to
   
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
localhost: starting tasktracker, logging to
   
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out
   
Everything seems ok, but when I check the Hadoop Logs I see many
 errors.
(and they all cause HBase connection problems.)
How can I solve this problem? Here are the Logs
   
 hadoop-oracle-datanode-localhost.localdomain.log:
2009-08-04 02:54:28,971 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.20.0
STARTUP_MSG:   build =
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20-r
763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
/
2009-08-04 02:54:29,562 ERROR
org.apache.hadoop.hdfs.server.datanode.DataNode:
 java.io.IOException:
Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode
namespaceID = 36527197; datanode namespaceID = 2138759529
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246)
   at
   
 org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368)
   
2009-08-04 02:54:29,563 INFO
org.apache.hadoop.hdfs.server.datanode.DataNode: 

Re: Problem with starting Hadoop in Pseudo Distributed Mode

2009-08-03 Thread Amandeep Khurana
No probs.

I hope you got the data directory to point out of /tmp as well... If not, do
that as well. Otherwise, when the /tmp gets cleaned up, you'll lose your
data.



Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


2009/8/3 Onur AKTAS onur.ak...@live.com


 Thank you very much!

 I added tags below to conf/core-site.xml and reformatted it again..
 it started without any problems and I also started HBase and connected it
 with a client!

 property
namehadoop.tmp.dir/name
 value/tmp/hadoop-onur/value
 descriptionA base for other temporary directories./description
   /property

 Thank you again..

  From: ama...@gmail.com
  Date: Mon, 3 Aug 2009 17:48:24 -0700
  Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
  To: common-user@hadoop.apache.org
 
  1. The default xmls are in $HADOOP_HOME/build/classes
  2. You have to ovverride the parameters and put them in the site.xml's so
  you can have it in some other directory and not /tmp
 
  Do that and try starting hadoop.
 
 
  Amandeep Khurana
  Computer Science Graduate Student
  University of California, Santa Cruz
 
 
  2009/8/3 Onur AKTAS onur.ak...@live.com
 
  
   There is no default.xml in Hadoop 0.20.0, but luckly I also have
 release
   0.18.3 and found these..
  
   property
namehadoop.tmp.dir/name
value/tmp/hadoop-${user.name}/value
descriptionA base for other temporary directories./description
   /property
  
   It seems /tmp/hadoop-${user.name} is a temporary directory as
 description
   indicates, then where is the real directory?
   I deleted whole tmp directory and formatted it again.. Started the
 server,
   checked the logs and still have same errors.
  
Date: Mon, 3 Aug 2009 17:29:52 -0700
Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode
From: ama...@gmail.com
To: common-user@hadoop.apache.org
   
Yes, you need to change these directories. The config is put in the
hadoop-site.xml. Or in this case, separately in the 3 xmls. See the
default xml for syntax and property name.
   
On 8/3/09, Onur AKTAS onur.ak...@live.com wrote:

 Is it the directory that Hadoop uses?

 /tmp/hadoop-oracle
 /tmp/hadoop-oracle/dfs/
 /tmp/hadoop-oracle/mapred/

 If yes, how can I change the directory to anywhere else? I do not
 want
   it to
 be kept in /tmp folder.

 From: ama...@gmail.com
 Date: Mon, 3 Aug 2009 17:02:50 -0700
 Subject: Re: Problem with starting Hadoop in Pseudo Distributed
 Mode
 To: common-user@hadoop.apache.org

 I'm assuming that you have no data in HDFS since it never came
 up...
   So,
 go
 ahead and clean up the directory where you are storing the
 datanode's
   data
 and the namenode's metadata. After that format the namenode and
   restart
 hadoop.


 2009/8/3 Onur AKTAS onur.ak...@live.com

 
  Hi,
 
  I'm having troubles with running Hadoop in RHEL 5, I did
 everything
   as
  documented in:
  http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html
 
  And configured:
  conf/core-site.xml, conf/hdfs-site.xml,
  conf/mapred-site.xml.
 
  Connected to localhost with ssh (did passphrase stuff etc.),
 then
   I
  did
  the following:
 
  $ bin/hadoop namenode -format
  $ bin/start-all.sh
  starting namenode, logging to
 
  
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out
  localhost: starting datanode, logging to
 
  
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out
  localhost: starting secondarynamenode, logging to
 
  
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out
  starting jobtracker, logging to
 
  
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out
  localhost: starting tasktracker, logging to
 
  
 /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out
 
  Everything seems ok, but when I check the Hadoop Logs I see many
   errors.
  (and they all cause HBase connection problems.)
  How can I solve this problem? Here are the Logs
 
   hadoop-oracle-datanode-localhost.localdomain.log:
  2009-08-04 02:54:28,971 INFO
  org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG:
  /
  STARTUP_MSG: Starting DataNode
  STARTUP_MSG:   host = localhost.localdomain/127.0.0.1
  STARTUP_MSG:   args = []
  STARTUP_MSG:   version = 0.20.0
  STARTUP_MSG:   build =
 
 https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20-r
  763504; compiled by 'ndaley' on Thu Apr  9 05:18:40 UTC 2009
  /
  2009-08-04 02:54:29,562 ERROR
 

Re: namenode -upgrade problem

2009-08-03 Thread bharath vissapragada
Todd thanks for replying ..

I stopped the cluster and issued the command

bin/hadoop namenode -upgrade and iam getting this exception

09/08/04 07:52:39 ERROR namenode.NameNode: java.net.BindException: Problem
binding to master/10.2.24.21:54310 : Address already in use
at org.apache.hadoop.ipc.Server.bind(Server.java:171)
at org.apache.hadoop.ipc.Server$Listener.init(Server.java:234)
at org.apache.hadoop.ipc.Server.init(Server.java:960)
at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:465)
at org.apache.hadoop.ipc.RPC.getServer(RPC.java:427)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:153)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
at
org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
Caused by: java.net.BindException: Address already in use
at sun.nio.ch.Net.bind(Native Method)
at
sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119)
at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59)
at org.apache.hadoop.ipc.Server.bind(Server.java:169)
... 9 more

any clue?

On Tue, Aug 4, 2009 at 12:51 AM, Todd Lipcon t...@cloudera.com wrote:

 On Mon, Aug 3, 2009 at 12:08 PM, bharath vissapragada 
 bharathvissapragada1...@gmail.com wrote:

  Hi all ,
 
  I have noticed some problem in my cluster when i changed the hadoop
 version
  on the same DFS directory .. The namenode log on the master says the
  following ..
 
 
  ile system image contains an old layout version -16.
  *An upgrade to version -18 is required.
  Please restart NameNode with -upgrade option.
  *


 See bolded text above -- you need to run namenode -upgrade to upgrade your
 metadata format to the current version.

 -Todd

   at
 

 
 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
  2009-08-04 00:27:51,498 INFO org.apache.hadoop.ipc.Server: Stopping
 server
  on 54310
  2009-08-04 00:27:51,498 ERROR
  org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException:
  File system image contains an old layout version -16.
  An upgrade to version -18 is required.
  Please restart NameNode with -upgrade option.
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194)
 at
 
 
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859)
 at
  org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868)
 
  2009-08-04 00:27:51,499 INFO
  org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG
 
  Can anyone explain me the reason ... i googled it .. but those
 explanations
  weren't quite useful
 
  Thanks
 



how to dump data from a mysql cluster to hdfs?

2009-08-03 Thread Min Zhou
hi all,

We need to dump data from a mysql cluster with about 50 nodes to a hdfs
file. Considered about the issues on security , we can't use tools like
sqoop, where all datanodes must hold a connection to mysql. any suggestions?


Thanks,
Min
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

My profile:
http://www.linkedin.com/in/coderplay
My blog:
http://coderplay.javaeye.com


RE: Counting no. of keys.

2009-08-03 Thread Amogh Vasekar
Have you had a look at the reporter.counter hadoop provides? I think it might 
be helpful in your case, where in you can locally aggregate for each map task 
and then push it to global counter.

-Original Message-
From: Zhong Wang [mailto:wangzhong@gmail.com] 
Sent: Monday, August 03, 2009 6:31 PM
To: common-user@hadoop.apache.org
Subject: Re: Counting no. of keys.

I have the same question, but i want to use map records number in
reduce phase exactly after the map. This is very useful in solving
problems like TF-IDF. In reduce (IDF calculating) phase, you must know
the total number of all documents. Is there any method to solve the
problem without running two Map-Reduce jobs?

On Sun, Aug 2, 2009 at 2:08 PM, Ted Dunningted.dunn...@gmail.com wrote:
 Sure.  Write a word count map-reduce program.  The mapper outputs the key
 from the sequence file as the output key and includes a count.  Then you do
 the normal combiner and reducer from a normal word count program.

 On Sat, Aug 1, 2009 at 9:53 PM, prashant ullegaddi prashullega...@gmail.com
 wrote:

 Hi,

 I've say 800 sequence files written using SequenceFileOutputFormat. Is
 there
 any way to know
 no. of unique keys in those sequence files?

 Thanks,
 Prashant.




 --
 Ted Dunning, CTO
 DeepDyve




-- 
Zhong Wang