Re: :!
unsubscribe On Mon, Aug 3, 2009 at 12:01 AM, Sugandha Naolekar sugandha@gmail.comwrote: dats fine. But, if I place the data in HDFS and then run map reduce code to provide compression, then the data will get compressed in sequence files but, even the original data will reside in the memory;thereby leading or causing a kind of redundancy of data... Can u pls suggest me a way out?/ On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi prashullega...@gmail.com wrote: I don't think you will be able to compress some data unless it's on HDFS. What you can do is 1. Manually compress the data on the machine where the data resides. Then, copy the same to HDFS. or 2. Copy the data without compressing to HDFS, then run a job which just emits the data as it reads in key/value pair. You can set FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so that output gets gzipped. Does that solve your problem? btw you didn't exactly specify your data size (how many TBs). On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar sugandha@gmail.comwrote: Yes, You are right. Here goes the details related:: - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine, which is not a part of the hadoop cluster. - I want to place the data of that machine into the HDFS. Thus, before placing it in HDFS, I want to compress it, and then dump in the HDFS. - I have 4 datanodes in my cluster. also, data might get extended upto tera bytes. - Also, i have set thr replication factor as 2. - I guess, for compression, I will have to run map reduce...? right..please tel me the complete approach that is needed to be followed. On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi prashullega...@gmail.com wrote: By I want to compress the data first and then place it in HDFS, do you mean you want to compress the data locally and then copy to DFS? What's the size of your data? What's the capacity of HDFS? On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar sugandha@gmail.comwrote: I want to compress the data first and then place it in HDFS. Again, while retrieving the same, I want to uncompress it and place on the desired destination. Can this be possible. How to get started? Also, I want to get started with actual coding part of compression and MAP reduce. PLease suggest me aptly...! -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha
Re: :!
How files are written can be controlled. Maybe you are using SequenceFileOutputFormat. You can setOutputFormat() to TextOutputFormat. I guess, this must solve your problem! On Mon, Aug 3, 2009 at 12:31 PM, Sugandha Naolekar sugandha@gmail.comwrote: dats fine. But, if I place the data in HDFS and then run map reduce code to provide compression, then the data will get compressed in sequence files but, even the original data will reside in the memory;thereby leading or causing a kind of redundancy of data... Can u pls suggest me a way out?/ On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi prashullega...@gmail.com wrote: I don't think you will be able to compress some data unless it's on HDFS. What you can do is 1. Manually compress the data on the machine where the data resides. Then, copy the same to HDFS. or 2. Copy the data without compressing to HDFS, then run a job which just emits the data as it reads in key/value pair. You can set FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so that output gets gzipped. Does that solve your problem? btw you didn't exactly specify your data size (how many TBs). On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar sugandha@gmail.comwrote: Yes, You are right. Here goes the details related:: - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine, which is not a part of the hadoop cluster. - I want to place the data of that machine into the HDFS. Thus, before placing it in HDFS, I want to compress it, and then dump in the HDFS. - I have 4 datanodes in my cluster. also, data might get extended upto tera bytes. - Also, i have set thr replication factor as 2. - I guess, for compression, I will have to run map reduce...? right..please tel me the complete approach that is needed to be followed. On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi prashullega...@gmail.com wrote: By I want to compress the data first and then place it in HDFS, do you mean you want to compress the data locally and then copy to DFS? What's the size of your data? What's the capacity of HDFS? On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar sugandha@gmail.comwrote: I want to compress the data first and then place it in HDFS. Again, while retrieving the same, I want to uncompress it and place on the desired destination. Can this be possible. How to get started? Also, I want to get started with actual coding part of compression and MAP reduce. PLease suggest me aptly...! -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha
RE: :!
Maybe I'm missing the point, but in terms of execution performance benefit, what does copying to dfs and then compressing to be fed to a map/reduce job provide? Isn't it better to compress offline / outside latency window and make available on dfs? Also, your mapreduce program will launch one map task per compressed file, so make sure you design your compression accordingly. Thanks, Amogh -Original Message- From: Sugandha Naolekar [mailto:sugandha@gmail.com] Sent: Monday, August 03, 2009 12:32 PM To: common-user@hadoop.apache.org Subject: Re: :! dats fine. But, if I place the data in HDFS and then run map reduce code to provide compression, then the data will get compressed in sequence files but, even the original data will reside in the memory;thereby leading or causing a kind of redundancy of data... Can u pls suggest me a way out?/ On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi prashullega...@gmail.com wrote: I don't think you will be able to compress some data unless it's on HDFS. What you can do is 1. Manually compress the data on the machine where the data resides. Then, copy the same to HDFS. or 2. Copy the data without compressing to HDFS, then run a job which just emits the data as it reads in key/value pair. You can set FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so that output gets gzipped. Does that solve your problem? btw you didn't exactly specify your data size (how many TBs). On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar sugandha@gmail.comwrote: Yes, You are right. Here goes the details related:: - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine, which is not a part of the hadoop cluster. - I want to place the data of that machine into the HDFS. Thus, before placing it in HDFS, I want to compress it, and then dump in the HDFS. - I have 4 datanodes in my cluster. also, data might get extended upto tera bytes. - Also, i have set thr replication factor as 2. - I guess, for compression, I will have to run map reduce...? right..please tel me the complete approach that is needed to be followed. On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi prashullega...@gmail.com wrote: By I want to compress the data first and then place it in HDFS, do you mean you want to compress the data locally and then copy to DFS? What's the size of your data? What's the capacity of HDFS? On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar sugandha@gmail.comwrote: I want to compress the data first and then place it in HDFS. Again, while retrieving the same, I want to uncompress it and place on the desired destination. Can this be possible. How to get started? Also, I want to get started with actual coding part of compression and MAP reduce. PLease suggest me aptly...! -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha
Re: :!
In my opinion it is best to compress it outside and then copy to HDFS. IN case you want to compress while copying the files to HDFS, you can make use of GZIPOutputStream to open a the file and write content to it . This will be compressed automatically. On Mon, Aug 3, 2009 at 12:48 PM, Amogh Vasekar am...@yahoo-inc.com wrote: Maybe I'm missing the point, but in terms of execution performance benefit, what does copying to dfs and then compressing to be fed to a map/reduce job provide? Isn't it better to compress offline / outside latency window and make available on dfs? Also, your mapreduce program will launch one map task per compressed file, so make sure you design your compression accordingly. Thanks, Amogh -Original Message- From: Sugandha Naolekar [mailto:sugandha@gmail.com] Sent: Monday, August 03, 2009 12:32 PM To: common-user@hadoop.apache.org Subject: Re: :! dats fine. But, if I place the data in HDFS and then run map reduce code to provide compression, then the data will get compressed in sequence files but, even the original data will reside in the memory;thereby leading or causing a kind of redundancy of data... Can u pls suggest me a way out?/ On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi prashullega...@gmail.com wrote: I don't think you will be able to compress some data unless it's on HDFS. What you can do is 1. Manually compress the data on the machine where the data resides. Then, copy the same to HDFS. or 2. Copy the data without compressing to HDFS, then run a job which just emits the data as it reads in key/value pair. You can set FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so that output gets gzipped. Does that solve your problem? btw you didn't exactly specify your data size (how many TBs). On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar sugandha@gmail.comwrote: Yes, You are right. Here goes the details related:: - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine, which is not a part of the hadoop cluster. - I want to place the data of that machine into the HDFS. Thus, before placing it in HDFS, I want to compress it, and then dump in the HDFS. - I have 4 datanodes in my cluster. also, data might get extended upto tera bytes. - Also, i have set thr replication factor as 2. - I guess, for compression, I will have to run map reduce...? right..please tel me the complete approach that is needed to be followed. On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi prashullega...@gmail.com wrote: By I want to compress the data first and then place it in HDFS, do you mean you want to compress the data locally and then copy to DFS? What's the size of your data? What's the capacity of HDFS? On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar sugandha@gmail.comwrote: I want to compress the data first and then place it in HDFS. Again, while retrieving the same, I want to uncompress it and place on the desired destination. Can this be possible. How to get started? Also, I want to get started with actual coding part of compression and MAP reduce. PLease suggest me aptly...! -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha -- cheers, Vibhooti
Re: MapFile performance
On Mon, Aug 3, 2009 at 3:09 AM, Billy Pearsonbilly_pear...@sbcglobal.net wrote: not sure if its still there but there was a parm in the hadoop-site conf file that would allow you to skip x number if index when reading it in to memory. This is io.map.index.skip (default 0), which will skip this number of keys for every key in the index. For example, if set to 2, one third of the keys will end up in memory. From what I understand we scan find the key offset just before the data and seek once and read until we find the key. Billy - Original Message - From: Andy Liu andyliu1227-re5jqeeqqe8avxtiumw...@public.gmane.org Newsgroups: gmane.comp.jakarta.lucene.hadoop.user To: core-user-7ArZoLwFLBtd/SJB6HiN2Ni2O/jbr...@public.gmane.org Sent: Tuesday, July 28, 2009 7:53 AM Subject: MapFile performance I have a bunch of Map/Reduce jobs that process documents and writes the results out to a few MapFiles. These MapFiles are subsequently searched in an interactive application. One problem I'm running into is that if the values in the MapFile data file are fairly large, lookup can be slow. This is because the MapFile index only stores every 128th key by default (io.map.index.interval), and after the binary search it may have to scan/skip through up to 127 values (off of disk) before it finds the matching record. I've tried io.map.index.interval = 1, which brings average get() times from 1200ms to 200ms, but at the cost of memory during runtime, which is undesirable. One possible solution is to have the MapFile index store every single key, offset pair. Then MapFile.Reader, upon startup, would read every 128th key in memory. MapFile.Reader.get() would behave the same way except instead of seeking through the values SequenceFile it would seek through the index SequenceFile until it finds the matching record, and then it can seek to the corresponding offset in the values. I'm going off the assumption that it's much faster to scan through the index (small keys) than it is to scan through the values (large values). Or maybe the index can be some kind of disk-based btree or bdb-like implementation? Anybody encounter this problem before? Andy
Re: RE: No Space Left On Device though space is available
no quota on the fs? On Aug 3, 2009 7:13 AM, Palleti, Pallavi pallavi.pall...@corp.aol.com wrote: No. These are production jobs which were working pretty fine and suddenly, we started seeing these issues. And, if you see the error log, the jobs are failing at the time of submission itself while copying the application jar. And, when I see the client machine disk size and also HDFS, it is only 60% full. Thanks Pallavi -Original Message- From: prashant ullegaddi [mailto: prashullega...@gmail.com] Sent: Monday...
Some issues!
I want to encrypt the data that would be placed in HDFS. So I will have to use some kind of encryption algorithms, right? Also, This encryption is to be done on data before placing it in HDFS. How this can be done? Any special API's available in HADOOP for the above purpose? -- Regards! Sugandha
Re: Some issues!
Sugandha Naolekar wrote: I want to encrypt the data that would be placed in HDFS. So I will have to use some kind of encryption algorithms, right? Also, This encryption is to be done on data before placing it in HDFS. How this can be done? Any special API's available in HADOOP for the above purpose? 1. Can I point you to the how to ask questions article, which emphasises the value in having meaningful titles http://catb.org/~esr/faqs/smart-questions.html 2. no encryption layers in Hadoop -not much of any security, in fact. javax.crypto is what you have to play with
Re: Some issues!
I am very sorry for the inconvenience caused. From next time, will take care of the questions to be asked in a precise manner. On Mon, Aug 3, 2009 at 3:58 PM, Steve Loughran ste...@apache.org wrote: Sugandha Naolekar wrote: I want to encrypt the data that would be placed in HDFS. So I will have to use some kind of encryption algorithms, right? Also, This encryption is to be done on data before placing it in HDFS. How this can be done? Any special API's available in HADOOP for the above purpose? 1. Can I point you to the how to ask questions article, which emphasises the value in having meaningful titles http://catb.org/~esr/faqs/smart-questions.htmlhttp://catb.org/%7Eesr/faqs/smart-questions.html 2. no encryption layers in Hadoop -not much of any security, in fact. javax.crypto is what you have to play with -- Regards! Sugandha
Compression related issues..!
Hello! I want to know - what's the difference between zipping a file(compressing) and actually implementing compression algorithms for compressing some sort of data? How much difference does it make and which one is preferable. I want to compress data to be placed in HDFS. Thanking You, -- Regards! Sugandha
Re: Counting no. of keys.
prashant ullegaddi wrote: Hi, I've say 800 sequence files written using SequenceFileOutputFormat. Is there any way to know no. of unique keys in those sequence files? Thanks, Prashant. You can use the counters map output records and reduce output records for this. If you can guarantee that every output key from reduce is unique, then the reduce output records is what you're looking for. If you're not using the reduce phase, then use map output records.
Re: Counting no. of keys.
I have the same question, but i want to use map records number in reduce phase exactly after the map. This is very useful in solving problems like TF-IDF. In reduce (IDF calculating) phase, you must know the total number of all documents. Is there any method to solve the problem without running two Map-Reduce jobs? On Sun, Aug 2, 2009 at 2:08 PM, Ted Dunningted.dunn...@gmail.com wrote: Sure. Write a word count map-reduce program. The mapper outputs the key from the sequence file as the output key and includes a count. Then you do the normal combiner and reducer from a normal word count program. On Sat, Aug 1, 2009 at 9:53 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've say 800 sequence files written using SequenceFileOutputFormat. Is there any way to know no. of unique keys in those sequence files? Thanks, Prashant. -- Ted Dunning, CTO DeepDyve -- Zhong Wang
Re: Difference between Killed Task Attempts and Killed Tasks
Hi, Task attempt is an attempt to a task. At any given time, one or more(speculative exec.) of task attempts can be running. For a task, there can be many attempts at different nodes. A task is complete if any of its attempts is complete. For a task to be marked as failed all of mapred.map.max.attempts should fail. For every task in the job, a TaskID is assigned. For every attempt, a TaskAttemptID is assigned (which ends with _0, _1, etc). Harish Mallipeddi wrote: Hi, Anyone can tell me what's the difference between Killed Task Attempts and Killed Tasks? I ran a big job (14820 maps and 0 reduces). In the job-details page, the web GUI reports 62 killed task attempts. I'm assuming this is due to speculative execution. Now when I go to the job-history page for the job, it reports 54 killed tasks (and 14820 successful map-tasks as expected). A few questions: * Why 62 killed task attempts vs 54 killed tasks? * Under speculative execution, does hadoop launch a new MapTask with new task-id or does it just launch a new MapTaskAttempt with a new task-attempt-id? * When a MapTaskAttempt fails, and when hadoop tries to re-launch the MapTask, does it create a new task-id or just a new task-attempt-id? * Does 'mapred.map.max.attempts' include all attempts launched due to speculative-execution? Btw this job is basically a trivial no-op job - it just scans around 1TB of data and does nothing else in the map. I looked at the killed tasks' syslog output and I didn't see any errors.
Re: Difference between Killed Task Attempts and Killed Tasks
Agreed. But how did I manage to get 54 killed tasks vs 62 killed task-attempts? I understand what a failed task is (a task for which 'mapred.map.max.attempts' attempts have failed). But what's a killed task? On Mon, Aug 3, 2009 at 6:41 PM, Enis Soztutar enis@gmail.com wrote: Hi, Task attempt is an attempt to a task. At any given time, one or more(speculative exec.) of task attempts can be running. For a task, there can be many attempts at different nodes. A task is complete if any of its attempts is complete. For a task to be marked as failed all of mapred.map.max.attempts should fail. For every task in the job, a TaskID is assigned. For every attempt, a TaskAttemptID is assigned (which ends with _0, _1, etc). Harish Mallipeddi wrote: Hi, Anyone can tell me what's the difference between Killed Task Attempts and Killed Tasks? I ran a big job (14820 maps and 0 reduces). In the job-details page, the web GUI reports 62 killed task attempts. I'm assuming this is due to speculative execution. Now when I go to the job-history page for the job, it reports 54 killed tasks (and 14820 successful map-tasks as expected). A few questions: * Why 62 killed task attempts vs 54 killed tasks? * Under speculative execution, does hadoop launch a new MapTask with new task-id or does it just launch a new MapTaskAttempt with a new task-attempt-id? * When a MapTaskAttempt fails, and when hadoop tries to re-launch the MapTask, does it create a new task-id or just a new task-attempt-id? * Does 'mapred.map.max.attempts' include all attempts launched due to speculative-execution? Btw this job is basically a trivial no-op job - it just scans around 1TB of data and does nothing else in the map. I looked at the killed tasks' syslog output and I didn't see any errors. -- Harish Mallipeddi http://blog.poundbang.in
Re: Status of 0.19.2
I've now updated the news section, and the documentation on the website to reflect the 0.19.2 release. There were several reports of it being more stable than 0.19.1 in the voting thread: http://www.mail-archive.com/common-...@hadoop.apache.org/msg00051.html Cheers, Tom On Tue, Jul 28, 2009 at 12:37 PM, Tamir Kamara tamirkam...@gmail.com wrote: Hi, I've seen that the 0.19.2 version was added recently to the downloads but there's no entry under the news section. Is it stable enough for deployment? Thanks, Tamir
FYI X-RIME: Hadoop based large scale social network analysis released
*X-RIM**E**(http://xrime.sourceforge.net/): Hadoop based large scale social network analysis* * Motivation* Today's telecom service providers and Internet-based social network sites possess huge user communities. They hold large amount of data about their users and want to generate core competency from the data. A key enabler for this is a cost efficient solution for social data management and social network analysis (SNA). Such a solution faces a few challenges. The most important one is that the solution should be able to handle massive and heterogeneous data sets. Facing this challenge, the traditional data warehouse based solutions are usually not cost efficient enough. On the other hand, existing SNA tools are mostly used in single workstation mode, and not scalable enough. To this end, low cost and highly scalable data management and processing technologies from cloud computing society should be brought in to help. However, most of existing cloud based data analysis solutions are trying to provide SQL-like general purpose query languages, and do not directly support social network analysis. This makes them hard to optimize and hard to use for SNA users. So, we came up with X-RIME to fix this gap. So, briefly speaking, X-RIME wants to provide a few value-added layers on top of existing cloud infrastructure, to support smart decision loops based on massive data sets and SNA. To end users, X-RIME is a library consists of Map-Reduce programs, which are used to do raw data pre-processing, transformation, SNA metrics and structures calculation, and graph / network visualization. The library could be integrated with other Hadoop based data warehouses (e.g., HIVE) to build more comprehensive solutions. *Currently Supported SNA Metrics and Structures* vertex degree statistics weakly connected components (WCC) strongly connected components (SCC) bi-connected components (BCC) ego-centric density bread first search / single source shortest path (BFS/SSSP) K-core maximal cliques pagerank hyperlink-induced topic search (HITS) minimal spanning tree (MST)
Job status task attempt 120%:
Hi, when I check my running jobs via the jobtracker web interface I see that one task attempt is at 120% . Is there a logical explanation? Thanks
Task process exit with nonzero status of 255
I'm getting a rather cryptic error while running a Map job with MultithreadedMapper (no idea if it has anything to do with the MultithreadedMapper). It only occurs sometimes, occurs at different times during the Map (sometimes at the start, sometimes at a random location), and it doesn't really give any information. Task Id : attempt_200908031207_0009_m_00_0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) I'm a bit at a loss. The only results I'm finding from Google are a few other people who have had the same problem, but nobody with a solution. Mathias De Maré
Re: Too many open files error, which gets resolved after some time
For writes, there is an extra thread waiting on i/o. So it would be 3 fds more. To simplify earlier equation, on the client side : for writes : max fds (for io bound load) = 7 * #write_streams for reads : max fds (for io bound load) = 4 * #read_streams The main socket is cleared as soon as you close the stream. The rest of fds stay for 10 sec (they get reused if you open more streams meanwhile). I hope this is clear enough. Raghu. Stas Oskin wrote: Hi. I'd like to raise this issue once again, just to clarify a point. If I have only one thread writing to HDFS, the amount of fd's should be 4, resulting from: 1) input 2) output 3) epoll 4) stream itself And these 4 fds should be cleared out after 10 seconds. Is this correct? Thanks in advance for the information! 2009/6/24 Stas Oskin stas.os...@gmail.com Hi. So if I open one stream, it should be 4? 2009/6/23 Raghu Angadi rang...@yahoo-inc.com how many threads do you have? Number of active threads is very important. Normally, #fds = (3 * #threads_blocked_on_io) + #streams 12 per stream is certainly way off. Raghu. Stas Oskin wrote: Hi. In my case it was actually ~ 12 fd's per stream, which included pipes and epolls. Could it be that HDFS opens 3 x 3 (input - output - epoll) fd's per each thread, which make it close to the number I mentioned? Or it always 3 at maximum per thread / stream? Up to 10 sec looks quite the correct number, it seems it gets freed arround this time indeed. Regards. 2009/6/23 Raghu Angadi rang...@yahoo-inc.com To be more accurate, once you have HADOOP-4346, fds for epoll and pipes = 3 * threads blocked on Hadoop I/O Unless you have hundreds of threads at a time, you should not see hundreds of these. These fds stay up to 10sec even after the threads exit. I am a bit confused about your exact situation. Please check number of threads if you still facing the problem. Raghu. Raghu Angadi wrote: since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a time? System.gc() won't help if you have HADOOP-4346. Ragu. Thanks for your opinion! 2009/6/22 Stas Oskin stas.os...@gmail.com Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Re: :!
Hey Sugandha, It's a common mistake - I think he was trying to unsubscribe to the mailing list (which is done by sending a message to a specific email address with the command unsubscribe), not telling you to unsubscribe. Brian On Aug 3, 2009, at 2:09 AM, Sugandha Naolekar wrote: This is ridiculous. What do you mean by unsubscribe.?? I have few queries and dats why have logged in to the corresponding forum. On Mon, Aug 3, 2009 at 12:33 PM, A BlueCoder bluecoder...@gmail.com wrote: unsubscribe On Mon, Aug 3, 2009 at 12:01 AM, Sugandha Naolekar sugandha@gmail.comwrote: dats fine. But, if I place the data in HDFS and then run map reduce code to provide compression, then the data will get compressed in sequence files but, even the original data will reside in the memory;thereby leading or causing a kind of redundancy of data... Can u pls suggest me a way out?/ On Mon, Aug 3, 2009 at 12:07 PM, prashant ullegaddi prashullega...@gmail.com wrote: I don't think you will be able to compress some data unless it's on HDFS. What you can do is 1. Manually compress the data on the machine where the data resides. Then, copy the same to HDFS. or 2. Copy the data without compressing to HDFS, then run a job which just emits the data as it reads in key/value pair. You can set FileOutputFormat.setOutputCompressorClass(job,GzipCodec.class) so that output gets gzipped. Does that solve your problem? btw you didn't exactly specify your data size (how many TBs). On Mon, Aug 3, 2009 at 11:02 AM, Sugandha Naolekar sugandha@gmail.comwrote: Yes, You are right. Here goes the details related:: - I have a Hadoop cluster of 7 nodes. Now there is this 8th machine, which is not a part of the hadoop cluster. - I want to place the data of that machine into the HDFS. Thus, before placing it in HDFS, I want to compress it, and then dump in the HDFS. - I have 4 datanodes in my cluster. also, data might get extended upto tera bytes. - Also, i have set thr replication factor as 2. - I guess, for compression, I will have to run map reduce...? right..please tel me the complete approach that is needed to be followed. On Mon, Aug 3, 2009 at 10:48 AM, prashant ullegaddi prashullega...@gmail.com wrote: By I want to compress the data first and then place it in HDFS, do you mean you want to compress the data locally and then copy to DFS? What's the size of your data? What's the capacity of HDFS? On Mon, Aug 3, 2009 at 10:45 AM, Sugandha Naolekar sugandha@gmail.comwrote: I want to compress the data first and then place it in HDFS. Again, while retrieving the same, I want to uncompress it and place on the desired destination. Can this be possible. How to get started? Also, I want to get started with actual coding part of compression and MAP reduce. PLease suggest me aptly...! -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha -- Regards! Sugandha
Re: Task process exit with nonzero status of 255
Thanks! Because of your information, I managed to find out that my crashes have to do with the somewhat standard 'Too many open files'. I added some 'close' and 'disconnect' to my InputStreams and HttpURLConnections, but that only works up to a certain point. The strange thing is that I do execute 'ulimit -n 32768', but I'm thinking it somehow 'won't' stick. The other alternative is that Hadoop is using 32768 file descriptors, but that seems a bit over the top, as I'm using at most 200 threads (each of which only sets up one HttpURLConnection). Suggestions are welcome :-) On Mon, Aug 3, 2009 at 3:58 PM, Jason Venner jason.had...@gmail.com wrote: That generally means that the process that is running the task, crashed. The actual map/reduce task is run in a separate jvm by the task tracker, and that JVM is exiting abnormally. This used to happen to my jobs quite a bit when they were using a buggy native library via jni. If you are trying to use the colorspace transforms via the java imaging APIs, it is not thread safe (at least through 1.6.10 under linux). There may be additional information available in the per task logs. 2009/8/3 Mathias De Maré mathias.dem...@gmail.com I'm getting a rather cryptic error while running a Map job with MultithreadedMapper (no idea if it has anything to do with the MultithreadedMapper). It only occurs sometimes, occurs at different times during the Map (sometimes at the start, sometimes at a random location), and it doesn't really give any information. Task Id : attempt_200908031207_0009_m_00_0, Status : FAILED java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) I'm a bit at a loss. The only results I'm finding from Google are a few other people who have had the same problem, but nobody with a solution. Mathias De Maré -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
namenode -upgrade problem
Hi all , I have noticed some problem in my cluster when i changed the hadoop version on the same DFS directory .. The namenode log on the master says the following .. ile system image contains an old layout version -16. An upgrade to version -18 is required. Please restart NameNode with -upgrade option. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-08-04 00:27:51,498 INFO org.apache.hadoop.ipc.Server: Stopping server on 54310 2009-08-04 00:27:51,498 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: File system image contains an old layout version -16. An upgrade to version -18 is required. Please restart NameNode with -upgrade option. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-08-04 00:27:51,499 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG Can anyone explain me the reason ... i googled it .. but those explanations weren't quite useful Thanks
Re: Too many open files error, which gets resolved after some time
Stas Oskin wrote: Hi. Thanks for the explanation. Just to clarify, the extra thread waiting on writing, is happens in multi-threading as well? Meaning if I have 10 writing threads for example, it would be actually 70 fd's? unfortunately, yes. There are different proposals to fix this : async I/O in Hadoop, RPCs for data transfers. It is not just the fds, the applications that hit fd limits hit thread limits as well. Obviously Hadoop can not sustain this as the range of applications increases. It will be fixed one way or the other. Raghu. Regards. 2009/8/3 Raghu Angadi rang...@yahoo-inc.com For writes, there is an extra thread waiting on i/o. So it would be 3 fds more. To simplify earlier equation, on the client side : for writes : max fds (for io bound load) = 7 * #write_streams for reads : max fds (for io bound load) = 4 * #read_streams The main socket is cleared as soon as you close the stream. The rest of fds stay for 10 sec (they get reused if you open more streams meanwhile). I hope this is clear enough. Raghu. Stas Oskin wrote: Hi. I'd like to raise this issue once again, just to clarify a point. If I have only one thread writing to HDFS, the amount of fd's should be 4, resulting from: 1) input 2) output 3) epoll 4) stream itself And these 4 fds should be cleared out after 10 seconds. Is this correct? Thanks in advance for the information! 2009/6/24 Stas Oskin stas.os...@gmail.com Hi. So if I open one stream, it should be 4? 2009/6/23 Raghu Angadi rang...@yahoo-inc.com how many threads do you have? Number of active threads is very important. Normally, #fds = (3 * #threads_blocked_on_io) + #streams 12 per stream is certainly way off. Raghu. Stas Oskin wrote: Hi. In my case it was actually ~ 12 fd's per stream, which included pipes and epolls. Could it be that HDFS opens 3 x 3 (input - output - epoll) fd's per each thread, which make it close to the number I mentioned? Or it always 3 at maximum per thread / stream? Up to 10 sec looks quite the correct number, it seems it gets freed arround this time indeed. Regards. 2009/6/23 Raghu Angadi rang...@yahoo-inc.com To be more accurate, once you have HADOOP-4346, fds for epoll and pipes = 3 * threads blocked on Hadoop I/O Unless you have hundreds of threads at a time, you should not see hundreds of these. These fds stay up to 10sec even after the threads exit. I am a bit confused about your exact situation. Please check number of threads if you still facing the problem. Raghu. Raghu Angadi wrote: since you have HADOOP-4346, you should not have excessive epoll/pipe fds open. First of all do you still have the problem? If yes, how many hadoop streams do you have at a time? System.gc() won't help if you have HADOOP-4346. Ragu. Thanks for your opinion! 2009/6/22 Stas Oskin stas.os...@gmail.com Ok, seems this issue is already patched in the Hadoop distro I'm using (Cloudera). Any idea if I still should call GC manually/periodically to clean out all the stale pipes / epolls? 2009/6/22 Steve Loughran ste...@apache.org Stas Oskin wrote: Hi. So what would be the recommended approach to pre-0.20.x series? To insure each file is used only by one thread, and then it safe to close the handle in that thread? Regards. good question -I'm not sure. For anythiong you get with FileSystem.get(), its now dangerous to close, so try just setting the reference to null and hoping that GC will do the finalize() when needed
Problem getting scheduler to work.
Hi! I'm testing out the FairScheduler and I'm getting it to start and the Pools that I've defined in the pools.xml file shows up and everything. But when trying to submit a job, I don't really know where to put the name of the pool to use for the job. All the examples that I've seen are using JobConf and I'm currently on 0.20. I tried to put the name on the Configuration like: conf.set(mapred.job.queue.name, fast); but just getting org.apache.hadoop.ipc.RemoteException: java.io.IOException: Queue fast does not exist So, how and where to I set the pool to use for the individual jobs? Erik
Re: Too many open files error, which gets resolved after some time
Hi Raghu. Thanks for the clarification and for explaining the potential issue. It is not just the fds, the applications that hit fd limits hit thread limits as well. Obviously Hadoop can not sustain this as the range of applications increases. It will be fixed one way or the other. Can you please clarify the thread limit matter? AFAIK it only happens if the allocated stack too large, and we speak about thousands of threads ( a possible solution described here: http://candrews.integralblue.com/2009/01/preventing-outofmemoryerror-native-thread/ ). So how it's tied to fd's? Thanks.
Hadoop BootCamp in Berlin Aug 27, 28th (reminder)
Hi all, A quick reminder that Scale Unlimited will run a 2 day Hadoop BootCamp in Berlin on August 27th and 28th. This 2 day course is for managers and developers who want to quickly become experienced with Hadoop and related technologies. The BootCamp provides training in MapReduce Theory, Hadoop Architecture, configuration, and API's through our hands-on labs. All our courses are taught by practitioners with years of Hadoop and related experience in large data architectures. ** Professional independent consultants may take this course for free, please email i...@scaleunlimited.com to inquire. http://www.scaleunlimited.com/courses/programs Detailed information and registration information is at: http://www.scaleunlimited.com/courses/berlin08 (german) or http://www.scaleunlimited.com/courses/hadoop-boot-camp-berlin-en (english) cheers, chris P.S Apologies for the cross posting. P.P.S. Please spread the word! ~~~ Hadoop training and consulting http://www.scaleunlimited.com
Problem with starting Hadoop in Pseudo Distributed Mode
Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode namespaceID = 36527197; datanode namespaceID = 2138759529 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1 / -- hadoop-oracle-namenode-localhost.localdomain.log 2009-08-04 02:54:26,987 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2009-08-04 02:54:27,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:9000 2009-08-04 02:54:27,179 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-08-04 02:54:27,180 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=oracle,oinstall,root,dba,oper,asmadmin 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-08-04 02:54:27,294 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics: Initializing FSNamesystemMetrics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-08-04 02:54:27,297 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Registered FSNamesystemStatusMBean 2009-08-04 02:54:27,341 INFO org.apache.hadoop.hdfs.server.common.Storage: Number of files = 8 2009-08-04
Re: Problem with starting Hadoop in Pseudo Distributed Mode
I'm assuming that you have no data in HDFS since it never came up... So, go ahead and clean up the directory where you are storing the datanode's data and the namenode's metadata. After that format the namenode and restart hadoop. 2009/8/3 Onur AKTAS onur.ak...@live.com Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode namespaceID = 36527197; datanode namespaceID = 2138759529 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1 / -- hadoop-oracle-namenode-localhost.localdomain.log 2009-08-04 02:54:26,987 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2009-08-04 02:54:27,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:9000 2009-08-04 02:54:27,179 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-08-04 02:54:27,180 INFO org.apache.hadoop.hdfs.server.namenode.metrics.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: fsOwner=oracle,oinstall,root,dba,oper,asmadmin 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: supergroup=supergroup 2009-08-04 02:54:27,278 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: isPermissionEnabled=true 2009-08-04 02:54:27,294 INFO org.apache.hadoop.hdfs.server.namenode.metrics.FSNamesystemMetrics:
Re: Problem with starting Hadoop in Pseudo Distributed Mode
Yes, you need to change these directories. The config is put in the hadoop-site.xml. Or in this case, separately in the 3 xmls. See the default xml for syntax and property name. On 8/3/09, Onur AKTAS onur.ak...@live.com wrote: Is it the directory that Hadoop uses? /tmp/hadoop-oracle /tmp/hadoop-oracle/dfs/ /tmp/hadoop-oracle/mapred/ If yes, how can I change the directory to anywhere else? I do not want it to be kept in /tmp folder. From: ama...@gmail.com Date: Mon, 3 Aug 2009 17:02:50 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode To: common-user@hadoop.apache.org I'm assuming that you have no data in HDFS since it never came up... So, go ahead and clean up the directory where you are storing the datanode's data and the namenode's metadata. After that format the namenode and restart hadoop. 2009/8/3 Onur AKTAS onur.ak...@live.com Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode namespaceID = 36527197; datanode namespaceID = 2138759529 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1 / -- hadoop-oracle-namenode-localhost.localdomain.log 2009-08-04 02:54:26,987 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:27,116 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2009-08-04 02:54:27,174 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at: localhost.localdomain/127.0.0.1:9000 2009-08-04 02:54:27,179 INFO
RE: Problem with starting Hadoop in Pseudo Distributed Mode
There is no default.xml in Hadoop 0.20.0, but luckly I also have release 0.18.3 and found these.. property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property It seems /tmp/hadoop-${user.name} is a temporary directory as description indicates, then where is the real directory? I deleted whole tmp directory and formatted it again.. Started the server, checked the logs and still have same errors. Date: Mon, 3 Aug 2009 17:29:52 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode From: ama...@gmail.com To: common-user@hadoop.apache.org Yes, you need to change these directories. The config is put in the hadoop-site.xml. Or in this case, separately in the 3 xmls. See the default xml for syntax and property name. On 8/3/09, Onur AKTAS onur.ak...@live.com wrote: Is it the directory that Hadoop uses? /tmp/hadoop-oracle /tmp/hadoop-oracle/dfs/ /tmp/hadoop-oracle/mapred/ If yes, how can I change the directory to anywhere else? I do not want it to be kept in /tmp folder. From: ama...@gmail.com Date: Mon, 3 Aug 2009 17:02:50 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode To: common-user@hadoop.apache.org I'm assuming that you have no data in HDFS since it never came up... So, go ahead and clean up the directory where you are storing the datanode's data and the namenode's metadata. After that format the namenode and restart hadoop. 2009/8/3 Onur AKTAS onur.ak...@live.com Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20 -r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode namespaceID = 36527197; datanode namespaceID = 2138759529 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at localhost.localdomain/127.0.0.1 / -- hadoop-oracle-namenode-localhost.localdomain.log 2009-08-04 02:54:26,987 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: STARTUP_MSG:
Re: Problem with starting Hadoop in Pseudo Distributed Mode
1. The default xmls are in $HADOOP_HOME/build/classes 2. You have to ovverride the parameters and put them in the site.xml's so you can have it in some other directory and not /tmp Do that and try starting hadoop. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz 2009/8/3 Onur AKTAS onur.ak...@live.com There is no default.xml in Hadoop 0.20.0, but luckly I also have release 0.18.3 and found these.. property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property It seems /tmp/hadoop-${user.name} is a temporary directory as description indicates, then where is the real directory? I deleted whole tmp directory and formatted it again.. Started the server, checked the logs and still have same errors. Date: Mon, 3 Aug 2009 17:29:52 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode From: ama...@gmail.com To: common-user@hadoop.apache.org Yes, you need to change these directories. The config is put in the hadoop-site.xml. Or in this case, separately in the 3 xmls. See the default xml for syntax and property name. On 8/3/09, Onur AKTAS onur.ak...@live.com wrote: Is it the directory that Hadoop uses? /tmp/hadoop-oracle /tmp/hadoop-oracle/dfs/ /tmp/hadoop-oracle/mapred/ If yes, how can I change the directory to anywhere else? I do not want it to be kept in /tmp folder. From: ama...@gmail.com Date: Mon, 3 Aug 2009 17:02:50 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode To: common-user@hadoop.apache.org I'm assuming that you have no data in HDFS since it never came up... So, go ahead and clean up the directory where you are storing the datanode's data and the namenode's metadata. After that format the namenode and restart hadoop. 2009/8/3 Onur AKTAS onur.ak...@live.com Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20-r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-oracle/dfs/data: namenode namespaceID = 36527197; datanode namespaceID = 2138759529 at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:233) at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:148) at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:298) at org.apache.hadoop.hdfs.server.datanode.DataNode.init(DataNode.java:216) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1283) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1238) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1246) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1368) 2009-08-04 02:54:29,563 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:
Re: Problem with starting Hadoop in Pseudo Distributed Mode
No probs. I hope you got the data directory to point out of /tmp as well... If not, do that as well. Otherwise, when the /tmp gets cleaned up, you'll lose your data. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz 2009/8/3 Onur AKTAS onur.ak...@live.com Thank you very much! I added tags below to conf/core-site.xml and reformatted it again.. it started without any problems and I also started HBase and connected it with a client! property namehadoop.tmp.dir/name value/tmp/hadoop-onur/value descriptionA base for other temporary directories./description /property Thank you again.. From: ama...@gmail.com Date: Mon, 3 Aug 2009 17:48:24 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode To: common-user@hadoop.apache.org 1. The default xmls are in $HADOOP_HOME/build/classes 2. You have to ovverride the parameters and put them in the site.xml's so you can have it in some other directory and not /tmp Do that and try starting hadoop. Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz 2009/8/3 Onur AKTAS onur.ak...@live.com There is no default.xml in Hadoop 0.20.0, but luckly I also have release 0.18.3 and found these.. property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property It seems /tmp/hadoop-${user.name} is a temporary directory as description indicates, then where is the real directory? I deleted whole tmp directory and formatted it again.. Started the server, checked the logs and still have same errors. Date: Mon, 3 Aug 2009 17:29:52 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode From: ama...@gmail.com To: common-user@hadoop.apache.org Yes, you need to change these directories. The config is put in the hadoop-site.xml. Or in this case, separately in the 3 xmls. See the default xml for syntax and property name. On 8/3/09, Onur AKTAS onur.ak...@live.com wrote: Is it the directory that Hadoop uses? /tmp/hadoop-oracle /tmp/hadoop-oracle/dfs/ /tmp/hadoop-oracle/mapred/ If yes, how can I change the directory to anywhere else? I do not want it to be kept in /tmp folder. From: ama...@gmail.com Date: Mon, 3 Aug 2009 17:02:50 -0700 Subject: Re: Problem with starting Hadoop in Pseudo Distributed Mode To: common-user@hadoop.apache.org I'm assuming that you have no data in HDFS since it never came up... So, go ahead and clean up the directory where you are storing the datanode's data and the namenode's metadata. After that format the namenode and restart hadoop. 2009/8/3 Onur AKTAS onur.ak...@live.com Hi, I'm having troubles with running Hadoop in RHEL 5, I did everything as documented in: http://hadoop.apache.org/common/docs/r0.20.0/quickstart.html And configured: conf/core-site.xml, conf/hdfs-site.xml, conf/mapred-site.xml. Connected to localhost with ssh (did passphrase stuff etc.), then I did the following: $ bin/hadoop namenode -format $ bin/start-all.sh starting namenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-namenode-localhost.localdomain.out localhost: starting datanode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-datanode-localhost.localdomain.out localhost: starting secondarynamenode, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-secondarynamenode-localhost.localdomain.out starting jobtracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-jobtracker-localhost.localdomain.out localhost: starting tasktracker, logging to /hda3/ps/hadoop-0.20.0/bin/../logs/hadoop-oracle-tasktracker-localhost.localdomain.out Everything seems ok, but when I check the Hadoop Logs I see many errors. (and they all cause HBase connection problems.) How can I solve this problem? Here are the Logs hadoop-oracle-datanode-localhost.localdomain.log: 2009-08-04 02:54:28,971 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = localhost.localdomain/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.20.0 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.20-r 763504; compiled by 'ndaley' on Thu Apr 9 05:18:40 UTC 2009 / 2009-08-04 02:54:29,562 ERROR
Re: namenode -upgrade problem
Todd thanks for replying .. I stopped the cluster and issued the command bin/hadoop namenode -upgrade and iam getting this exception 09/08/04 07:52:39 ERROR namenode.NameNode: java.net.BindException: Problem binding to master/10.2.24.21:54310 : Address already in use at org.apache.hadoop.ipc.Server.bind(Server.java:171) at org.apache.hadoop.ipc.Server$Listener.init(Server.java:234) at org.apache.hadoop.ipc.Server.init(Server.java:960) at org.apache.hadoop.ipc.RPC$Server.init(RPC.java:465) at org.apache.hadoop.ipc.RPC.getServer(RPC.java:427) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:153) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) Caused by: java.net.BindException: Address already in use at sun.nio.ch.Net.bind(Native Method) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:119) at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:59) at org.apache.hadoop.ipc.Server.bind(Server.java:169) ... 9 more any clue? On Tue, Aug 4, 2009 at 12:51 AM, Todd Lipcon t...@cloudera.com wrote: On Mon, Aug 3, 2009 at 12:08 PM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: Hi all , I have noticed some problem in my cluster when i changed the hadoop version on the same DFS directory .. The namenode log on the master says the following .. ile system image contains an old layout version -16. *An upgrade to version -18 is required. Please restart NameNode with -upgrade option. * See bolded text above -- you need to run namenode -upgrade to upgrade your metadata format to the current version. -Todd at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-08-04 00:27:51,498 INFO org.apache.hadoop.ipc.Server: Stopping server on 54310 2009-08-04 00:27:51,498 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: File system image contains an old layout version -16. An upgrade to version -18 is required. Please restart NameNode with -upgrade option. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:312) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:87) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:309) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:288) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:163) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:208) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:194) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:859) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:868) 2009-08-04 00:27:51,499 INFO org.apache.hadoop.hdfs.server.namenode.NameNode: SHUTDOWN_MSG Can anyone explain me the reason ... i googled it .. but those explanations weren't quite useful Thanks
how to dump data from a mysql cluster to hdfs?
hi all, We need to dump data from a mysql cluster with about 50 nodes to a hdfs file. Considered about the issues on security , we can't use tools like sqoop, where all datanodes must hold a connection to mysql. any suggestions? Thanks, Min -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com
RE: Counting no. of keys.
Have you had a look at the reporter.counter hadoop provides? I think it might be helpful in your case, where in you can locally aggregate for each map task and then push it to global counter. -Original Message- From: Zhong Wang [mailto:wangzhong@gmail.com] Sent: Monday, August 03, 2009 6:31 PM To: common-user@hadoop.apache.org Subject: Re: Counting no. of keys. I have the same question, but i want to use map records number in reduce phase exactly after the map. This is very useful in solving problems like TF-IDF. In reduce (IDF calculating) phase, you must know the total number of all documents. Is there any method to solve the problem without running two Map-Reduce jobs? On Sun, Aug 2, 2009 at 2:08 PM, Ted Dunningted.dunn...@gmail.com wrote: Sure. Write a word count map-reduce program. The mapper outputs the key from the sequence file as the output key and includes a count. Then you do the normal combiner and reducer from a normal word count program. On Sat, Aug 1, 2009 at 9:53 PM, prashant ullegaddi prashullega...@gmail.com wrote: Hi, I've say 800 sequence files written using SequenceFileOutputFormat. Is there any way to know no. of unique keys in those sequence files? Thanks, Prashant. -- Ted Dunning, CTO DeepDyve -- Zhong Wang