log files not found
Hi all, I am running a series of jobs one after another. While executing the 4th job, the job fails. It fails in the reducer --- the progress percentage would be map 100%, reduce 99%. It gives out the following message 10/04/01 01:04:15 INFO mapred.JobClient: Task Id : attempt_201003240138_0110_r_18_1, Status : FAILED Task attempt_201003240138_0110_r_18_1 failed to report status for 602 seconds. Killing! It makes several attempts again to execute it but fails with similar message. I couldn't get anything from this error message and wanted to look at logs (located in the default dir of ${HADOOP_HOME/logs}). But I don't find any files which match the timestamp of the job. Also I did not find history and userlogs in the logs folder. Should I look at some other place for the logs? What could be the possible causes for the above error? I am using Hadoop 0.20.2 and I am running it on a cluster with 14 nodes. Thank you. Regards, Raghava.
Re: log
Along with JobTracker maintaining history in ${hadoop.log.dir}/logs/history, in branch 0.20, the job history is available in a user location also. User location can be specified for configuration “hadoop.job.history.user.location”. By default, if nothing is specified for the configuration, the history will be created in output directory of the job. The user history can be disabled by specifying the value “none” for configuration. Gang, if you are not seeing the history for some of your jobs, there could be a couple of reasons. 1. Your job does not have any output directory. You can specify a different location for user history. 2. Job history got disabled for some problem with Job’s configuration. You can check JobTracker logs here and verify if the history got disabled. Thanks Amareshwari On 4/1/10 12:13 AM, "Gang Luo" wrote: Thanks Abhishek. but I observe that some of my job output has no such _log directory. Actually, I run a script which launch 100+ jobs. I didn't find the log for any of the output. Any ideas? Thanks, -Gang - 原始邮件 发件人: abhishek sharma 收件人: common-user@hadoop.apache.org 发送日期: 2010/3/31 (周三) 1:15:48 下午 主 题: Re: log Gang, In the log/history directory, two files are created for each job--one xml file that records the configuration and the other file has log entries. These log entries have all the information about the individual map and reduce tasks related to a job--which nodes they ran on, duration, input size, etc. A single log/history directory is created by Hadoop and files related to all the jobs executed are stored there. Abhishek On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo wrote: > Hi all, > I find there is a directory "_log/history/..." under the output directory of > a mapreduce job. Is the file in that directory a log file? Is the information > there sufficient to allow me to figure out what nodes the job runs on? > Besides, not every job has such a directory. Is there such settings > controlling this? Or is there other ways to get the nodes my job runs on? > > Thanks, > -Gang > > > >
hadoop communication problem with server
Hi Everyone, My hadoop install has been throwing such errors quite regularly now. After running for a while I get this error: 10/03/31 21:04:17 INFO mapred.JobClient: Communication problem with server: java.io.IOException: Call to hill.local/10.42.1.1:8021 failed on local exception: java.io.EOFException This has been happening frequently to my cluster setup and after this error things only get fixed after a restart. During this time, the webpage for monitoring hadoop doesn't open up and hadoop job -list hangs up. Would really appreciate if you can give any ideas . Thanks H Morpheus: Do you believe in fate, Neo? Neo: No. Morpheus: Why Not? Neo: Because I don't like the idea that I'm not in control of my life.
Re: How to Recommission?
On 3/31/10 8:12 PM, "Zhanlei Ma" wrote: > But how to Recommission? Wish your help. Take them out of dfs.exclude and refreshnodes again.
Re: Hadoop DFS IO Performance measurement
I completely forgot that the raid controller can be a bottle neck as can the disk connection strategy I don't remember what the PERC's top out at, and I don't recall the aggregate actual bandwidth available for the disks. I have a simple SSD that I get steady 100MB/sec out of with sata 1, I would guess that sata 2 tops out about 150MB/sec in the real world. Exercise each of your components in isolation ie: dd if=/dev/MY_RAID0 of=/dev/null bs=64k count=10 to get an idea of what the disk subsystem can deliver the dfs client passes everything through the socket layer which adds additional copying and latency. On Wed, Mar 31, 2010 at 6:31 PM, Jason Venner wrote: > Unless you are getting all local IO, and or you have better than GigE > nic interfaces > 100MB/sec is your cap. > > For local IO the bound is going to be your storage subsystem. > Decent drives in a raid 0 interface are going to cap out on those > machines about 400MB/sec, which is the buffer cache bandwidth on those > processors/memory. > realistically you are going to see a quite a bit less, but 200 should be > doable. > > > On Wed, Mar 31, 2010 at 2:55 PM, Sagar Naik wrote: >> Hi Edson, >> >> usual commodity machine : >> 8GB Ram, 2.5GHZ Intel Xeon, >> >> 6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6 >> >> Centos , ext4 >> >> Datanodes configured to use 4 RAID 0 drives >> OS and hadoop installation on RAID 1 >> >> -Sagar >> On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote: >> >>> Hi Sagar, >>> >>> What hardware did you run it on ? >>> >>> Edson Ramiro >>> >>> >>> On 30 March 2010 19:41, sagar naik wrote: >>> Hi All, I am trying to get DFS IO performance. I used TestDFSIO from hadoop jars. The results were abt 100Mbps read and write . I think it should be more than this Pl share some stats to compare Either I am missing something like config params or something else -Sagar >> >> > > > > -- > Pro Hadoop, a book to guide you from beginner to hadoop mastery, > http://www.amazon.com/dp/1430219424?tag=jewlerymall > www.prohadoopbook.com a community for Hadoop Professionals > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Hadoop DFS IO Performance measurement
Unless you are getting all local IO, and or you have better than GigE nic interfaces 100MB/sec is your cap. For local IO the bound is going to be your storage subsystem. Decent drives in a raid 0 interface are going to cap out on those machines about 400MB/sec, which is the buffer cache bandwidth on those processors/memory. realistically you are going to see a quite a bit less, but 200 should be doable. On Wed, Mar 31, 2010 at 2:55 PM, Sagar Naik wrote: > Hi Edson, > > usual commodity machine : > 8GB Ram, 2.5GHZ Intel Xeon, > > 6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6 > > Centos , ext4 > > Datanodes configured to use 4 RAID 0 drives > OS and hadoop installation on RAID 1 > > -Sagar > On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote: > >> Hi Sagar, >> >> What hardware did you run it on ? >> >> Edson Ramiro >> >> >> On 30 March 2010 19:41, sagar naik wrote: >> >>> Hi All, >>> >>> I am trying to get DFS IO performance. >>> I used TestDFSIO from hadoop jars. >>> The results were abt 100Mbps read and write . >>> I think it should be more than this >>> >>> Pl share some stats to compare >>> >>> Either I am missing something like config params or something else >>> >>> >>> -Sagar >>> > > -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: swapping on hadoop
On Linux, check out the 'swappiness' OS tunable -- you can turn this down from the default to reduce swapping at the expense of some system file cache. However, you want a decent chunk of RAM left for the system to cache files -- if it is all allocated and used by Hadoop there will be extra I/O. For Java GC, if your -Xmx is above 600MB or so, try either changing -XX:NewRatio to a smaller number (default is 2 for Sun JDK 6) or setting the -XX:MaxNewSize parameter to around 150MB to 250MB. An example of Hadoop memory use scaling as -Xmx grows: Lets say you have a Hadoop job with a 100MB map side join, and 250MB of hadoop sort space. Both of these chunks of data will eventually get pushed to the tenured generation. So, the actual heap required will end up close to: (Size of young generation) + 100MB + 250MB + misc. The default size of the young generation is 1/3 of the heap. So, at -Xmx750M this job will probably use a minimum of 600MB of java heap, plus about 50MB non-heap if this is a pure java job. Now, perhaps due to some other jobs you want to set -Xmx1200M. The above job will end up using about 150MB more now, because the new space has grown, although the footprint is the same. A larger new space can improve performance, but with most typical hadoop jobs it won't. Making sure it does not grow larger just because -Xmx is larger can help save a lot of memory. Additionally, a job that would have failed with an OOME at -Xmx1200M might pass at -Xmx1000M if the young generation takes 150MB instead of 400MB of the space. If you are using a 64 bit JRE, you can also save space with the -XX:+UseCompressedOops option -- sometimes quite a bit of space. On Mar 30, 2010, at 10:15 AM, Vasilis Liaskovitis wrote: > Hi all, > > I 've noticed swapping for a single terasort job on a small 8-node > cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I > can have back to back runs of the same job from the same hdfs input > data and get swapping only on 1 out of 4 identical runs. I 've noticed > this swapping behaviour on both terasort jobs and hive query jobs. > > - Focusing on a single job config, Is there a rule of thumb about how > much node memory should be left for use outside of Child JVMs? > I make sure that per Node, there is free memory: > (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * > JVMHeapSize < PhysicalMemoryonNode > The total JVM heap size per node per job from the above equation > currently account 65%-75% of the node's memory. (I 've tried > allocating a riskier 90% of the node's memory, with similar swapping > observations). > > - Could there be an issue with HDFS data or metadata taking up memory? > I am not cleaning output or intermediate outputs from HDFS between > runs. Is this possible? > > - Do people use any specific java flags (particularly garbage > collection flags) for production environments where one job runs (or > possibly more jobs run simultaneously) ? > > - What are the memory requirements for the jobtracker,namenode and > tasktracker,datanode JVMs? > > - I am setting io.sort.mb to about half of the JVM heap size (half of > -Xmx in javaopts). Should this be set to a different ratio? (this > setting doesn't sound like it should be causing swapping in the first > place). > > - The buffer cache is cleaned before each run (flush and echo 3 > > /proc/sys/vm/drop_caches) > > any empirical advice and suggestions to solve this are appreciated. > thanks, > > - Vasilis
Re: Hadoop DFS IO Performance measurement
Hi Edson, usual commodity machine : 8GB Ram, 2.5GHZ Intel Xeon, 6 Disks : 2 drives RAID 1 , 4 RAID 0 using PERC 6 Centos , ext4 Datanodes configured to use 4 RAID 0 drives OS and hadoop installation on RAID 1 -Sagar On Mar 30, 2010, at 4:21 PM, Edson Ramiro wrote: > Hi Sagar, > > What hardware did you run it on ? > > Edson Ramiro > > > On 30 March 2010 19:41, sagar naik wrote: > >> Hi All, >> >> I am trying to get DFS IO performance. >> I used TestDFSIO from hadoop jars. >> The results were abt 100Mbps read and write . >> I think it should be more than this >> >> Pl share some stats to compare >> >> Either I am missing something like config params or something else >> >> >> -Sagar >>
RE: is there any way we can limit Hadoop Datanode's disk usage?
> From: awittena...@linkedin.com > To: common-user@hadoop.apache.org > Subject: Re: is there any way we can limit Hadoop Datanode's disk usage? > Date: Wed, 31 Mar 2010 18:09:04 + > > On 3/30/10 8:12 PM, "steven zhuang" wrote: > > > hi, guys, > >we have some machine with 1T disk, some with 100GB disk, > >I have this question that is there any means we can limit the > > disk usage of datanodes on those machines with smaller disk? > >thanks! > > > You can use dfs.datanode.du.reserved, but be aware that are *no* limits on > mapreduce's usage, other than what you can create with file system quotas. > > I've started recommended creating file system partitions in order to work > around Hadoop's crazy space reservation ideas. > Hmmm. Our sysadmins decided to put each of the jbod disks in to their own volume group. Kind of makes sense if you want to limit any impact that Hadoop could cause. (Assuming someone forgot to set up the dfs.datanode.du.reserved) But I do agree that at a minimum, the file space used by hadoop should be a partition and not on the '/' (root) disk space. _ Hotmail: Trusted email with powerful SPAM protection. http://clk.atdmt.com/GBL/go/210850553/direct/01/
OutOfMemoryError: Cannot create GC thread. Out of system resources
Hi all, When I run the pi Hadoop sample I get this error: 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=true&taskid=attempt_201003311545_0001_r_02_0&filter=stdout 10/03/31 15:46:13 WARN mapred.JobClient: Error reading task outputhttp:// h04.ctinfra.ufpr.br:50060/tasklog?plaintext=true&taskid=attempt_201003311545_0001_r_02_0&filter=stderr 10/03/31 15:46:20 INFO mapred.JobClient: Task Id : attempt_201003311545_0001_m_06_1, Status : FAILED java.io.IOException: Task process exit with nonzero status of 134. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:418) May be its because the datanode can't create more threads. ram...@lcpad:~/hadoop-0.20.2$ cat logs/userlogs/attempt_201003311457_0001_r_01_2/stdout # # A fatal error has been detected by the Java Runtime Environment: # # java.lang.OutOfMemoryError: Cannot create GC thread. Out of system resources. # # Internal Error (gcTaskThread.cpp:38), pid=28840, tid=140010745776400 # Error: Cannot create GC thread. Out of system resources. # # JRE version: 6.0_17-b04 # Java VM: Java HotSpot(TM) 64-Bit Server VM (14.3-b01 mixed mode linux-amd64 ) # An error report file with more information is saved as: # /var-host/tmp/hadoop-ramiro/mapred/local/taskTracker/jobcache/job_201003311457_0001/attempt_201003311457_0001_r_01_2/work/hs_err_pid28840.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # I configured the limits bellow, but I'm still getting the same error. fs.inmemory.size.mb 100 mapred.child.java.opts -Xmx128M Do you know what limit should I configure to fix it? Thanks in Advance Edson Ramiro
Re: log
Thanks Abhishek. but I observe that some of my job output has no such _log directory. Actually, I run a script which launch 100+ jobs. I didn't find the log for any of the output. Any ideas? Thanks, -Gang - 原始邮件 发件人: abhishek sharma 收件人: common-user@hadoop.apache.org 发送日期: 2010/3/31 (周三) 1:15:48 下午 主 题: Re: log Gang, In the log/history directory, two files are created for each job--one xml file that records the configuration and the other file has log entries. These log entries have all the information about the individual map and reduce tasks related to a job--which nodes they ran on, duration, input size, etc. A single log/history directory is created by Hadoop and files related to all the jobs executed are stored there. Abhishek On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo wrote: > Hi all, > I find there is a directory "_log/history/..." under the output directory of > a mapreduce job. Is the file in that directory a log file? Is the information > there sufficient to allow me to figure out what nodes the job runs on? > Besides, not every job has such a directory. Is there such settings > controlling this? Or is there other ways to get the nodes my job runs on? > > Thanks, > -Gang > > > >
Re: Redirecting hadoop log messages to a log file at client side
Just for information, you can call the same code from the `hadoop` command, the benefit is that a lot of java config parameters are set up for you (for good or for bad): hadoop and set HADOOP_CLASSPATH to include your custom jar (no need to include hadoop jars this way) It looks like there is a conflict between reading the resources in your case... But it looks like you know what you are doing. Alex K On Wed, Mar 31, 2010 at 4:24 AM, Pallavi Palleti < pallavi.pall...@corp.aol.com> wrote: > Hi Alex, > > I created a jar including my client code (specified in manifest) and needed > jar files like hadoop-20.jar, log4j.jar, commons-logging.jar and ran the > application as > java -cp -jar above_jar > needed-parameters-for-client-code. > > I will explore using commons-logging in my client code. > > Thanks > Pallavi > > > On 03/30/2010 10:14 PM, Alex Kozlov wrote: > >> Hi Pallavi, >> >> DFSClient uses log4j.properties for configuration. What is your >> classpath? >> I need to know how exactly you invoke your program (java, hadoop script, >> etc.). The log level and appender is driven by the hadoop.root.logger >> config variable. >> >> I would also recommend to use one logging system in the code, which will >> be >> commons-logging in this case. >> >> Alex K >> >> On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti< >> pallavi.pall...@corp.aol.com> wrote: >> >> >> >>> Hi Alex, >>> >>> Thanks for the reply. I have already created a logger (from >>> log4j.logger)and configured the same to log it to a file and it is >>> logging >>> for all the log statements that I have in my client code. However, the >>> error/info logs of DFSClient are going to stdout. The DFSClient code is >>> using log from commons-logging.jar. I am wondering how to redirect those >>> logs (which are right now going to stdout) to append to the existing >>> logger >>> in client code. >>> >>> Thanks >>> Pallavi >>> >>> >>> >>> On 03/30/2010 12:06 PM, Alex Kozlov wrote: >>> >>> >>> Hi Pallavi, It depends what logging configuration you are using. If it's log4j, you need to modify (or create) log4j.properties file and point you code (via classpath) to it. A sample log4j.properties is in the conf directory (either apache or CDH distributions). Alex K On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti< pallavi.pall...@corp.aol.com> wrote: > Hi, > > I am copying certain data from a client machine (which is not part of > the > cluster) using DFSClient to HDFS. During this process, I am > encountering > some issues and the error/info logs are going to stdout. Is there a > way, > I > can configure the property at client side so that the error/info logs > are > appended to existing log file (being created using logger at client > code) > rather writing to stdout. > > Thanks > Pallavi > > > > > >>> >>> >> >> >
Re: C++ pipes on full (nonpseudo) cluster
On 3/31/10 9:53 AM, "Keith Wiley" wrote: > I'm still trying to find a way to build the binary on my Mac instead of > logging into some remote Linux machine. I ought to be able to build a binary > on the Mac that runs on a Linux cluster if I use the right flags and link > against the right libraries, but so far I've had no success on that front. You'll need either a cross compiler or a virtual machine. You're probably better off logging into that remote box.
Re: swapping on hadoop
On 3/30/10 10:15 AM, "Vasilis Liaskovitis" wrote: > - Focusing on a single job config, Is there a rule of thumb about how > much node memory should be left for use outside of Child JVMs? > I make sure that per Node, there is free memory: > (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * > JVMHeapSize < PhysicalMemoryonNode > The total JVM heap size per node per job from the above equation > currently account 65%-75% of the node's memory. (I 've tried > allocating a riskier 90% of the node's memory, with similar swapping > observations). Java takes more RAM than just the heap size. Sometimes 2-3x as much. My general rule of thumb for general purpose grids is to plan on having 3-4gb of free VM (swap+physical) space for the OS, monitoring, datanode, and task tracker processes. After that, you can carve it up however you want. If you are running on machines with less than that much, you likely are going to face at least a little bit of swapping or have minimum monitoring or whatever.
Re: is there any way we can limit Hadoop Datanode's disk usage?
On 3/30/10 8:12 PM, "steven zhuang" wrote: > hi, guys, >we have some machine with 1T disk, some with 100GB disk, >I have this question that is there any means we can limit the > disk usage of datanodes on those machines with smaller disk? >thanks! You can use dfs.datanode.du.reserved, but be aware that are *no* limits on mapreduce's usage, other than what you can create with file system quotas. I've started recommended creating file system partitions in order to work around Hadoop's crazy space reservation ideas.
Re: Query over DFSClient
> However, I couldn't figure out where this exception is thrown. After lastException is set, the exception is thrown when next FSOutputStream.write or close is called. Hairong On 3/31/10 6:01 AM, "Ankur C. Goel" wrote: > Pallavi, > If a DFSClient is not able to write a block to any of the > datanodes given by namenode it retries N times before aborting. > See - https://issues.apache.org/jira/browse/HDFS-167 > This should be handled by the application as this indicates that something is > seriously wrong with your cluster > > Hope this helps > -...@nkur > > On 3/31/10 4:59 PM, "Pallavi Palleti" wrote: > > Hi, > > I am looking into hadoop-20 source code for below issue. From DFSClient, > I could see that once the datanodes given by namenode are not reachable, > it is setting "lastException" variable to error message saying "recovery > from primary datanode is failed N times, aborting.."(line No:2546 in > processDataNodeError). However, I couldn't figure out where this > exception is thrown. I could see the throw statement in isClosed() but > not finding the exact sequence after Streamer exits with lastException > set to isClosed() method call. It would be great if some one could shed > some light on this. I am essentially looking whether DFSClient > approaches namenode in the case of failure of all datanodes that > namenode has given for a given data block previously. > > Thanks > Pallavi > > > On 03/30/2010 05:01 PM, Pallavi Palleti wrote: >> Hi, >> >> Could some one kindly let me know if the DFSClient takes care of >> datanode failures and attempt to write to another datanode if primary >> datanode (and replicated datanodes) fail. I looked into the souce code >> of DFSClient and figured out that it attempts to write to one of the >> datanodes in pipeline and fails if it failed to write to at least one >> of them. However, I am not sure as I haven't explored fully. If so, is >> there a way of querying namenode to provide different datanodes in the >> case of failure. I am sure the Mapper would be doing similar >> thing(attempting to fetch different datanode from namenode) if it >> fails to write to datanodes. Kindly let me know. >> >> Thanks >> Pallavi >> >
Re: log
Gang, In the log/history directory, two files are created for each job--one xml file that records the configuration and the other file has log entries. These log entries have all the information about the individual map and reduce tasks related to a job--which nodes they ran on, duration, input size, etc. A single log/history directory is created by Hadoop and files related to all the jobs executed are stored there. Abhishek On Tue, Mar 30, 2010 at 8:50 PM, Gang Luo wrote: > Hi all, > I find there is a directory "_log/history/..." under the output directory of > a mapreduce job. Is the file in that directory a log file? Is the information > there sufficient to allow me to figure out what nodes the job runs on? > Besides, not every job has such a directory. Is there such settings > controlling this? Or is there other ways to get the nodes my job runs on? > > Thanks, > -Gang > > > >
Re: C++ pipes on full (nonpseudo) cluster
I have solved the problem...I guess. If one looks back through this thread, one will observe that one of the initial errors I was receiving was the following: stderr logs bash: /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic: cannot execute binary file I had misinterpreted this as an indication that the file did not have executable permissions. This misinterpretation was exacerbated by the observation that I could not set executable permissions on an HDFS file so files in HDFS really truly are nonexecutable. However, it turns out that this is okay because the Pipes library enables executable permissions before running a binary. The symptom nevertheless suggested my initial interpretation, that this was a file permissions problem. The error I was receiving above was not a file permissions problem, it was an architecture mismatch between the binary and the cluster nodes. A tremendously labored quest has converged on a delicate build configuration which, for the time being, seems to work. I'm still trying to find a way to build the binary on my Mac instead of logging into some remote Linux machine. I ought to be able to build a binary on the Mac that runs on a Linux cluster if I use the right flags and link against the right libraries, but so far I've had no success on that front. Thanks for you help. Cheers! Keith Wiley kwi...@keithwiley.com www.keithwiley.com "What I primarily learned in grad school is how much I *don't* know. Consequently, I left grad school with a higher ignorance to knowledge ratio than when I entered." -- Keith Wiley
Re: DeDuplication Techniques
Thanks everyone. I think we are going to go HBase and use the hash map structures there to keep uniqueness (table.exists(value)) on our values, see it how it goes. Appreciate the insights I do like (being a developer) the extended sort and join ideas in the MR will likely use that for other things. Seems like a lot of work just to get to the point to execute our business logic (time/resources, etc). As the old saying goes "sometimes you only need chopsticks to catch a fly" =8^) Thanks again!!! /* Joe Stein http://www.linkedin.com/in/charmalloc */ On Fri, Mar 26, 2010 at 3:26 PM, Jeyendran Balakrishnan wrote: > Joe, > > This is what I use for a related problem, using pure HDFS [no HBase]: > > 1. Run a one-time map-reduce job where you input your current historical file > of hashes [say it is of the format in some kind of > flat file] using IdentityMapper and the output of your custom reducer is > is or maybe even > to save space. The important thing is to use MapFileOutputFormat for the > reducer output instead of the typical SequenceFileOutputFormat. Now you have > a single look-up table which you use for efficient lookup using your hash > keys. > Note down the HDFS path of where you stored this mapfile, call it > dedupMapFile. > > 2. In your incremental data update job, pass the HDFS path of dedupMapFile to > your conf, then open the mapfile in your reducer configure(), store the > reference to the mapfile in the class, and close it in close(). > Inside your reduce(), use the mapfile reference to lookup your hashkey; if > there is a hit, it is a dup. > > 3. Also, for your reducer in 2. above, you can use a multiple output format > custom format, in which one of the outputs is your current output, and the > other is a new dedup output sequencefile which is in the same key-value > format as the dedupMapFile. So in the reduce() if the current key value is a > dup, discard it, else output to both your regular output, and the new dedup > output. > > 4. After each incremental update job, run a new map reduce job > [IdentityMapper and IdentityReducer] to merge the new dedup file with your > old dedupMapFile, resulting in the updated dedupMapFile. > > Some comments: > * I didn't read your approach too closely, so I suspect you might be doing > something essentially like this already. > * All this stuff is basically what HBase does for free, where your > dedupMapFile is now a HBase table, and you don't have to run Step 4, since > you can just write new [non-duplicate] hash-keys to the HBase table in Step > 3, and in Step 2, you just use table.exists(hash-key) to check if it is a > dup. You still need Step 1 to populate the table with your historical data. > > Hope this helps > > Cheers, > jp > > > -Original Message- > From: Joseph Stein [mailto:crypt...@gmail.com] > Sent: Thursday, March 25, 2010 11:35 AM > To: common-user@hadoop.apache.org > Subject: Re: DeDuplication Techniques > > The thing is I have to check historic data (meaning data I have > already aggregated against) so I basically need to hold and read from > a file of hashes. > > So within the current data set yes this would work but I then have to > open a file, loop through the value, see it is not there. > > If it is there then throw it out, if not there add it to the end. > > To me this opening a file checking for dups is a map/reduce task in itself. > > What I was thinking is having my mapper take the data I wasn to > validate as unique. I then loop through the files filters. each data > point has a key that then allows me to get the file that has it's > data. e.g. a part of the data partions the hash of the data so each > file holds. So my map job takes the data and breaks it into the > key/value pair (the key allows me to look up my filter file). > > When it gets to the reducer... the key is the file I open up, I then > open the file... loop through it... if it is there throw the data > away. if it is not there then add the hash of my data to the filter > file and then output (as the reduce output) the value of the unique. > > This output of the unique is then the data I aggregate on which also > updated my historic filter so the next job (5 minutes later) see it, > etc. > > On Thu, Mar 25, 2010 at 2:25 PM, Mark Kerzner wrote: >> Joe, >> >> what about this approach: >> >> using hashmap values as your keys in MR maps. Since they are sorted by keys, >> in reducer you will get all duplicates together, so that you can loop >> through them. As the simplest solution, you just take the first one. >> >> Sincerely, >> Mark >> >> On Thu, Mar 25, 2010 at 1:09 PM, Joseph Stein wrote: >> >>> I have been researching ways to handle de-dupping data while running a >>> map/reduce program (so as to not re-calculate/re-aggregate data that >>> we have seen before[possibly months before]). >>> >>> The data sets we have are littered with repeats of data from mobile >>> devices which continue to come in over time (so
Re: C++ pipes on full (nonpseudo) cluster
Could you please state exactly the steps you do in setting up the pipes run? The two critical things to watch are: a/ where did you load the executable on hdfs e.g., $ ls pipes-bin genreads pair-reads seqal $ hadoop dfs -put pipes-bin pipes-bin $ hadoop dfs -ls hdfs://host:53897/user/zag/pipes-bin Found 3 items -rw-r--r-- 3 zag supergroup480 2010-03-17 12:28 /user/zag/pipes-bin/genreads -rw-r--r-- 3 zag supergroup692 2010-03-17 15:02 /user/zag/pipes-bin/pair_reads -rw-r--r-- 3 zag supergroup477 2010-03-17 12:33 /user/zag/pipes-bin/seqal b/ how you started the program $ hadoop pipes -D hadoop.pipes.executable=pipes-bin/genreads -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input input -output output hope this is clear enough. --gianluigi On Wed, 2010-03-31 at 06:57 -0700, Keith Wiley wrote: > On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote: > > > What happens if you try this: > > > > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > > hadoop.pipes.executable=EXECUTABLE -D > > hadoop.pipes.java.recordreader=true -D > > hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output > > HDFSPATH/output > > > Not good news. This is what I got: > > $ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ > hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D > hadoop.pipes.java.recordreader=true -D > hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ > input -output /uwphysics/kwiley/mosaic/output > Exception in thread "main" java.io.FileNotFoundException: File does > not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ > Mosaic/src/cpp/Mosaic > at > org > .apache > .hadoop > .hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: > 457) > at > org > .apache > .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: > 509) > at > org > .apache > .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) > at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) > at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) > at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) > > Incidentally, just in case you're wondering: > $ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ > Mosaic/src/cpp/ > total 800 > 368 -rwxr-xr-x 1 keithwiley keithwiley 185184 Mar 29 19:08 Mosaic* > ...other files... > > The path is obviously correct on my local machine. The only > explanation is that Hadoop is looking for it on HDFS under that path. > > I'm desperate. I don't understand why I'm the only person who can get > this working. Could you please describe to me the set of commands you > use to run a pipes program on a fully distributed cluster? > > > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "The easy confidence with which I know another man's religion is folly > teaches > me to suspect that my own is also." > -- Mark Twain > >
Re: C++ pipes on full (nonpseudo) cluster
On 2010, Mar 31, at 4:25 AM, Gianluigi Zanetti wrote: What happens if you try this: $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.executable=EXECUTABLE -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output Not good news. This is what I got: $ hadoop pipes -D hadoop.pipes.executable=/Users/keithwiley/Astro_LSST/ hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input /uwphysics/kwiley/mosaic/ input -output /uwphysics/kwiley/mosaic/output Exception in thread "main" java.io.FileNotFoundException: File does not exist: /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ Mosaic/src/cpp/Mosaic at org .apache .hadoop .hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java: 457) at org .apache .hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java: 509) at org .apache .hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) Incidentally, just in case you're wondering: $ ls -l /Users/keithwiley/Astro_LSST/hadoop-0.20.1+152/Mosaic/clue/ Mosaic/src/cpp/ total 800 368 -rwxr-xr-x 1 keithwiley keithwiley 185184 Mar 29 19:08 Mosaic* ...other files... The path is obviously correct on my local machine. The only explanation is that Hadoop is looking for it on HDFS under that path. I'm desperate. I don't understand why I'm the only person who can get this working. Could you please describe to me the set of commands you use to run a pipes program on a fully distributed cluster? Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "The easy confidence with which I know another man's religion is folly teaches me to suspect that my own is also." -- Mark Twain
Re: Query over DFSClient
Pallavi, If a DFSClient is not able to write a block to any of the datanodes given by namenode it retries N times before aborting. See - https://issues.apache.org/jira/browse/HDFS-167 This should be handled by the application as this indicates that something is seriously wrong with your cluster Hope this helps -...@nkur On 3/31/10 4:59 PM, "Pallavi Palleti" wrote: Hi, I am looking into hadoop-20 source code for below issue. From DFSClient, I could see that once the datanodes given by namenode are not reachable, it is setting "lastException" variable to error message saying "recovery from primary datanode is failed N times, aborting.."(line No:2546 in processDataNodeError). However, I couldn't figure out where this exception is thrown. I could see the throw statement in isClosed() but not finding the exact sequence after Streamer exits with lastException set to isClosed() method call. It would be great if some one could shed some light on this. I am essentially looking whether DFSClient approaches namenode in the case of failure of all datanodes that namenode has given for a given data block previously. Thanks Pallavi On 03/30/2010 05:01 PM, Pallavi Palleti wrote: > Hi, > > Could some one kindly let me know if the DFSClient takes care of > datanode failures and attempt to write to another datanode if primary > datanode (and replicated datanodes) fail. I looked into the souce code > of DFSClient and figured out that it attempts to write to one of the > datanodes in pipeline and fails if it failed to write to at least one > of them. However, I am not sure as I haven't explored fully. If so, is > there a way of querying namenode to provide different datanodes in the > case of failure. I am sure the Mapper would be doing similar > thing(attempting to fetch different datanode from namenode) if it > fails to write to datanodes. Kindly let me know. > > Thanks > Pallavi >
Re: Query over DFSClient
Hi, I am looking into hadoop-20 source code for below issue. From DFSClient, I could see that once the datanodes given by namenode are not reachable, it is setting "lastException" variable to error message saying "recovery from primary datanode is failed N times, aborting.."(line No:2546 in processDataNodeError). However, I couldn't figure out where this exception is thrown. I could see the throw statement in isClosed() but not finding the exact sequence after Streamer exits with lastException set to isClosed() method call. It would be great if some one could shed some light on this. I am essentially looking whether DFSClient approaches namenode in the case of failure of all datanodes that namenode has given for a given data block previously. Thanks Pallavi On 03/30/2010 05:01 PM, Pallavi Palleti wrote: Hi, Could some one kindly let me know if the DFSClient takes care of datanode failures and attempt to write to another datanode if primary datanode (and replicated datanodes) fail. I looked into the souce code of DFSClient and figured out that it attempts to write to one of the datanodes in pipeline and fails if it failed to write to at least one of them. However, I am not sure as I haven't explored fully. If so, is there a way of querying namenode to provide different datanodes in the case of failure. I am sure the Mapper would be doing similar thing(attempting to fetch different datanode from namenode) if it fails to write to datanodes. Kindly let me know. Thanks Pallavi
Re: C++ pipes on full (nonpseudo) cluster
What happens if you try this: $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D hadoop.pipes.executable=EXECUTABLE -D hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true -input HDFSPATH/input -output HDFSPATH/output > Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output On Tue, 2010-03-30 at 15:05 -0700, Keith Wiley wrote: > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true > -input HDFSPATH/input -output HDFSPATH/output -program HDFSPATH/EXECUTABLE > Deleted hdfs://mainclusternn.hipods.ihost.com/HDFSPATH/output > 10/03/30 14:56:55 WARN mapred.JobClient: No job jar file set. User classes > may not be found. See JobConf(Class) or JobConf#setJar(String). > 10/03/30 14:56:55 INFO mapred.FileInputFormat: Total input paths to process : > 1 > 10/03/30 14:57:05 INFO mapred.JobClient: Running job: job_201003241650_1076 > 10/03/30 14:57:06 INFO mapred.JobClient: map 0% reduce 0% > ^C > $ > > At that point the terminal hung, so I eventually ctrl-Ced to break it. Now > if I investigate the Hadoop task logs for the mapper, I see this: > > stderr logs > bash: > /data/disk2/hadoop/mapred/local/taskTracker/archive/mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/c++_bin/Mosaic/Mosaic: > cannot execute binary file > > ...which makes perfect sense in light of the following: > > $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin > Found 1 items > -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 > /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ hd fs -chmod 755 /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ hd fs -ls /uwphysics/kwiley/mosaic/c++_bin > Found 1 items > -rw-r--r-- 1 kwiley uwphysics 211808 2010-03-30 10:26 > /uwphysics/kwiley/mosaic/c++_bin/Mosaic > $ > > Note that this is all in attempt to run an executable that was uploaded to > HDFS in advance. In this example I am not attempting to run an executable > stored on my local machine. Any attempt to do that results in a file not > found error: > > $ hadoop fs -rmr HDFSPATH/output ; hadoop pipes -D > hadoop.pipes.java.recordreader=true -D hadoop.pipes.java.recordwriter=true > -input HDFSPATH/input -output HDFSPATH/output -program LOCALPATH/EXECUTABLE > Deleted hdfs://mainclusternn.hipods.ihost.com/uwphysics/kwiley/mosaic/output > Exception in thread "main" java.io.FileNotFoundException: File does not > exist: /Users/kwiley/hadoop-0.20.1+152/Mosaic/clue/Mosaic/src/cpp/Mosaic > at > org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:457) > at > org.apache.hadoop.filecache.DistributedCache.getTimestamp(DistributedCache.java:509) > at > org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:681) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:802) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:771) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1290) > at org.apache.hadoop.mapred.pipes.Submitter.runJob(Submitter.java:248) > at org.apache.hadoop.mapred.pipes.Submitter.run(Submitter.java:479) > at org.apache.hadoop.mapred.pipes.Submitter.main(Submitter.java:494) > $ > > It's clearly looking or the executable in HDFS, not on the local system, thus > the file not found error. > > > Keith Wiley kwi...@keithwiley.com > www.keithwiley.com > > "What I primarily learned in grad school is how much I *don't* know. > Consequently, I left grad school with a higher ignorance to knowledge ratio > than > when I entered." > -- Keith Wiley > > > > >
Re: Redirecting hadoop log messages to a log file at client side
Hi Alex, I created a jar including my client code (specified in manifest) and needed jar files like hadoop-20.jar, log4j.jar, commons-logging.jar and ran the application as java -cp -jar above_jar needed-parameters-for-client-code. I will explore using commons-logging in my client code. Thanks Pallavi On 03/30/2010 10:14 PM, Alex Kozlov wrote: Hi Pallavi, DFSClient uses log4j.properties for configuration. What is your classpath? I need to know how exactly you invoke your program (java, hadoop script, etc.). The log level and appender is driven by the hadoop.root.logger config variable. I would also recommend to use one logging system in the code, which will be commons-logging in this case. Alex K On Tue, Mar 30, 2010 at 12:12 AM, Pallavi Palleti< pallavi.pall...@corp.aol.com> wrote: Hi Alex, Thanks for the reply. I have already created a logger (from log4j.logger)and configured the same to log it to a file and it is logging for all the log statements that I have in my client code. However, the error/info logs of DFSClient are going to stdout. The DFSClient code is using log from commons-logging.jar. I am wondering how to redirect those logs (which are right now going to stdout) to append to the existing logger in client code. Thanks Pallavi On 03/30/2010 12:06 PM, Alex Kozlov wrote: Hi Pallavi, It depends what logging configuration you are using. If it's log4j, you need to modify (or create) log4j.properties file and point you code (via classpath) to it. A sample log4j.properties is in the conf directory (either apache or CDH distributions). Alex K On Mon, Mar 29, 2010 at 11:25 PM, Pallavi Palleti< pallavi.pall...@corp.aol.com> wrote: Hi, I am copying certain data from a client machine (which is not part of the cluster) using DFSClient to HDFS. During this process, I am encountering some issues and the error/info logs are going to stdout. Is there a way, I can configure the property at client side so that the error/info logs are appended to existing log file (being created using logger at client code) rather writing to stdout. Thanks Pallavi
Re: swapping on hadoop
Hi, >>(#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < >>PhysicalMemoryonNode The tasktracker and datanode daemons also take up memory, 1GB each by default I think. Is that accounted for? >> Could there be an issue with HDFS data or metadata taking up memory? Is the namenode a separate machine or participates in compute nodes too? >>What are the memory requirements for the jobtracker,namenode and >>tasktracker,datanode JVMs? See above, there was a thread running on this on the forum sometime back, to manipulate these values for TT and DN. >>this setting doesn't sound like it should be causing swapping in the first >>place ( io.sort.mb) I think so too :) Just yesterday I read a tweet on machine configs for Hadoop, hope it helps you http://bit.ly/cphF7R Amogh On 3/30/10 10:45 PM, "Vasilis Liaskovitis" wrote: Hi all, I 've noticed swapping for a single terasort job on a small 8-node cluster using hadoop-0.20.1. The swapping doesn't happen repeatably; I can have back to back runs of the same job from the same hdfs input data and get swapping only on 1 out of 4 identical runs. I 've noticed this swapping behaviour on both terasort jobs and hive query jobs. - Focusing on a single job config, Is there a rule of thumb about how much node memory should be left for use outside of Child JVMs? I make sure that per Node, there is free memory: (#maxmapTasksperTaskTracker + #maxreduceTasksperTaskTracker) * JVMHeapSize < PhysicalMemoryonNode The total JVM heap size per node per job from the above equation currently account 65%-75% of the node's memory. (I 've tried allocating a riskier 90% of the node's memory, with similar swapping observations). - Could there be an issue with HDFS data or metadata taking up memory? I am not cleaning output or intermediate outputs from HDFS between runs. Is this possible? - Do people use any specific java flags (particularly garbage collection flags) for production environments where one job runs (or possibly more jobs run simultaneously) ? - What are the memory requirements for the jobtracker,namenode and tasktracker,datanode JVMs? - I am setting io.sort.mb to about half of the JVM heap size (half of -Xmx in javaopts). Should this be set to a different ratio? (this setting doesn't sound like it should be causing swapping in the first place). - The buffer cache is cleaned before each run (flush and echo 3 > /proc/sys/vm/drop_caches) any empirical advice and suggestions to solve this are appreciated. thanks, - Vasilis
Re: java.io.IOException: Function not implemented
Edson Ramiro wrote: May be it's a bug. I'm not the admin. : ( so, I'll talk to him and may be he install a 2.6.32.9 in another node to test : ) Thanks Edson Ramiro On 30 March 2010 20:00, Todd Lipcon wrote: Hi Edson, I noticed that only the h01 nodes are running 2.6.32.9, the other broken DNs are 2.6.32.10. Is there some reason you are running a kernel that is literally 2 weeks old? I wouldn't be at all surprised if there were a bug here, or some issue with your Debian "unstable" distribution... If you are running the SCM trunk of the OS, you are part of the dev team. They will be grateful for the bugs you find and fix, but you get to find and fix them. In Ant one bugrep was stopped setting dates in the past, turned out that on the debian nightly builds, you couldn't touch any file into the past... -steve
[ANN] Eclipse GIT plugin beta version released
GIT is one of the most popular distributed version control system. In the hope, that more Java developers may want to explore the world of easy branching, merging and patch management, I'd like to inform you, that a beta version of the upcoming Eclipse GIT plugin is available: http://www.infoq.com/news/2010/03/egit-released http://aniszczyk.org/2010/03/22/the-start-of-an-adventure-egitjgit-0-7-1/ Maybe, one day, some apache / hadoop projects will use GIT... :-) (Yes, I know git.apache.org.) Best regards, Thomas Koch, http://www.koch.ro