Re: combine two map tasks
The ChainMapper class introduced in Hadoop 19 will provide you with the ability to have an arbitrary number of map tasks to run one after the other, in the context of a single job. The one issue to be aware of is that the chain of mappers only see the output the previous map in the chain. There is a nice discussion of this in chapter 8 of Pro Hadoop, by Apress.com On Sun, Jun 28, 2009 at 5:04 AM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: See this .. hope this answers your question . http://developer.yahoo.com/hadoop/tutorial/module4.html#tips On Sun, Jun 28, 2009 at 5:28 PM, bonito bonito.pe...@gmail.com wrote: Hello! I am a new hadoop user and my question may sound naive.. However, I would like to ask if there is a way to combine the results of two mpa tasks that may run simultaneously. I use the MultipleInput class and thus I have two different mappers. I want the result/output of the one map (associated with one input file) to be used in the process of the second map (associated with the second input file). I have thought of storing the map1 output in the hdfs and retrieving it using the map2. However, I have no clue whether this is possible. I mean...what about time-executing issues? map2 has to wait until map1 is completed... The thought of executing them in a serial manner is not the one I really want... Any suggestion would be appreciated. Thank you in advance :) -- View this message in context: http://www.nabble.com/combine-two-map-tasks-tp24240928p24240928.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Scaling out/up or a mix
How about multi-threaded mappers? Multi-Threaded mappers are ideal for map tasks that are non locally io bound with many distinct endpoints. You can also control the thread count on a per job basis. On Sat, Jun 27, 2009 at 8:26 AM, Marcus Herou marcus.he...@tailsweep.comwrote: The argument currently against increasing num-mappers is that the machines will get into oom and since a lot of the jobs are crawlers I need more ip-numbers so I don't get banned :) Thing is that we currently have solr on the very same machines and data-nodes as well so I can only give the MR nodes about 1G memory since I need SOLR to have 4G... Now I see that I should get some obvious and juste critique about the layout of this arch but I'm a little limited in budget and so is then the arch :) However is it wise to have the MR tasks on the same nodes as the data-nodes or should I split the arch ? I mean the data-nodes perhaps need more disk-IO and the MR more memory and CPU ? Trying to find a sweetspot hardware spec of those two roles. //Marcus On Sat, Jun 27, 2009 at 4:24 AM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Marcus, Are you recording the data rates coming out of HDFS? Since you have such a low CPU utilizations, I'd look at boxes utterly packed with big hard drives (also, why are you using RAID1 for Hadoop??). You can get 1U boxes with 4 drive bays or 2U boxes with 12 drive bays. Based on the data rates you see, make the call. On the other hand, what's the argument against running 3x more mappers per box? It seems that your boxes still have more overhead to use -- there's no I/O wait. Brian On Jun 26, 2009, at 4:43 PM, Marcus Herou wrote: Hi. We have a deployment of 10 hadoop servers and I now need more mapping capability (no not just add more mappers per instance) since I have so many jobs running. Now I am wondering what I should aim on... Memory, cpu or disk... How long is a rope perhaps you would say ? A typical server is currently using about 15-20% cpu today on a quad-core 2.4Ghz 8GB RAM machine with 2 RAID1 SATA 500GB disks. Some specs below. mpstat 2 5 Linux 2.6.24-19-server (mapreduce2) 06/26/2009 11:36:13 PM CPU %user %nice%sys %iowait%irq %soft %steal %idleintr/s 11:36:15 PM all 22.820.003.241.370.622.490.00 69.45 8572.50 11:36:17 PM all 13.560.001.741.990.622.610.00 79.48 8075.50 11:36:19 PM all 14.320.002.241.121.122.240.00 78.95 9219.00 11:36:21 PM all 14.710.000.871.620.251.750.00 80.80 8489.50 11:36:23 PM all 12.690.000.871.240.500.750.00 83.96 5495.00 Average: all 15.620.001.791.470.621.970.00 78.53 7970.30 What I am thinking is... Is it wiser to go for many of these cheap boxes with 8GB of RAM or should I for instance focus on machines which can give more I|O throughput ? I know that these things are hard but perhaps someone have draw some conclusions before the pragmatic way. Kindly //Marcus -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Marcus Herou CTO and co-founder Tailsweep AB +46702561312 marcus.he...@tailsweep.com http://www.tailsweep.com/ -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Are .bz2 extensions supported in Hadoop 18.3
I believe the cloudera 18.3 supports bzip2 On Wed, Jun 24, 2009 at 3:45 AM, Usman Waheed usm...@opera.com wrote: Hi All, Can I map/reduce logs that have the .bz2 extension in Hadoop 18.3? I tried but interestingly the output was not what i expected versus what i got when my data was in uncompressed format. Thanks, Usman -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: CompositeInputFormat scalbility
The join package does a streaming merge sort between each part-X in your input directories, part- will be handled a single task, part-0001 will be handled in a single task and so on These jobs are essentially io bound, and hard to beat for performance. On Wed, Jun 24, 2009 at 2:09 PM, pmg parmod.me...@gmail.com wrote: I have two files FileA (with 600K records) and FileB (With 2million records) FileA has a key which is same of all the records 123724101722493 123781676672721 FileB has the same key as FileA 1235026328101569 1235026328001562 Using hadoop join package I can create output file with tuples and cross product of FileA and FileB. 123[724101722493,5026328101569] 123[724101722493,5026328001562] 123[781676672721,5026328101569] 123[781676672721,5026328001562] How does CompositeInputFormat scale when we want to join 600K with 2 millions records. Does it run on the node with single map/reduce? Also how can I not write the result into a file instead input split the result into different nodes where I can compare the tuples e.g. comparing 724101722493 with 5026328101569 using some heuristics. thanks -- View this message in context: http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: CompositeInputFormat scalbility
The input split size is Long.MAX_VALUE. and in actual fact, the contents of each directory are sorted separately. The number of directory entries for each has to be identical. and all files in index position I, where I varies from 0 to the number of files in a directory, become the input to 1 task. My book goes into this in some detail with examples. Without patches mapside join can only handle 32 directories. On Wed, Jun 24, 2009 at 9:09 PM, pmg parmod.me...@gmail.com wrote: And what decides part-, part-0001input split, block size? So for example for 1G of data on HDFS with 64m block size get 16 blocks mapped to different map tasks? jason hadoop wrote: The join package does a streaming merge sort between each part-X in your input directories, part- will be handled a single task, part-0001 will be handled in a single task and so on These jobs are essentially io bound, and hard to beat for performance. On Wed, Jun 24, 2009 at 2:09 PM, pmg parmod.me...@gmail.com wrote: I have two files FileA (with 600K records) and FileB (With 2million records) FileA has a key which is same of all the records 123724101722493 123781676672721 FileB has the same key as FileA 1235026328101569 1235026328001562 Using hadoop join package I can create output file with tuples and cross product of FileA and FileB. 123[724101722493,5026328101569] 123[724101722493,5026328001562] 123[781676672721,5026328101569] 123[781676672721,5026328001562] How does CompositeInputFormat scale when we want to join 600K with 2 millions records. Does it run on the node with single map/reduce? Also how can I not write the result into a file instead input split the result into different nodes where I can compare the tuples e.g. comparing 724101722493 with 5026328101569 using some heuristics. thanks -- View this message in context: http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24192957.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- View this message in context: http://www.nabble.com/CompositeInputFormat-scalbility-tp24192957p24196664.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Does balancer ensure a file's replication is satisfied?
The namenode is constantly receiving reports about what datanode has what blocks, and performing replication when a block becomes under replicated. On Tue, Jun 23, 2009 at 6:18 PM, Stuart White stuart.whi...@gmail.comwrote: In my Hadoop cluster, I've had several drives fail lately (and they've been replaced). Each time a new empty drive is placed in the cluster, I run the balancer. I understand that the balancer will redistribute the load of file blocks across the nodes. My question is: will balancer also look at the desired replication of a file, and if the actual replication of a file is less than the desired (because the file had blocks stored on the lost drive), will balancer re-replicate those lost blocks? If not, is there another tool that will ensure the desired replication factor of files is satisfied? If this functionality doesn't exist, I'm concerned that I'm slowly, silently losing my files as I replace drives, and I may not even realize it. Thoughts? -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Determining input record directory using Streaming...
I happened to have a copy of 18.1 lying about, and the JobConf is added to the per process runtime environment in 18.1. The entire configuration from the JobConf object is added to the environment, with the jobconf key names being transformed slightly. Any character in the key name, that is not one of [a-zA-Z0-9] is replaced by an '_' character. On Tue, Jun 23, 2009 at 10:29 AM, Bo Shi b...@deadpuck.net wrote: Jason, do you know offhand when this feature was introduced? .18.x? Thanks, Bo On Mon, Jun 22, 2009 at 10:58 PM, jason hadoopjason.had...@gmail.com wrote: Check the process environment for your streaming tasks, generally the configuration variables are exported into the process environment. The Mapper input file is normally stored as some variant of mapred.input.file. The reducer's input is the mapper output for that reduce, so the input file is not relevant. On Mon, Jun 22, 2009 at 7:21 PM, C G parallel...@yahoo.com wrote: Hi All: Is there any way using Hadoop Streaming to determining the directory from which an input record is being read? This is straightforward in Hadoop using InputFormats, but I am curious if the same concept can be applied to streaming. The goal here is to read in data from 2 directories, say A/ and B/, and make decisions about what to do based on where the data is rooted. Thanks for any help...CG -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Strange Exeception
The directory specified by the configuration parameter mapred.system.dir, defaulting to /tmp/hadoop/mapred/system, doesn't exist. Most likely your tmp cleaner task has removed it, and I am guessing it is only created at cluster start time. On Mon, Jun 22, 2009 at 6:19 PM, akhil1988 akhilan...@gmail.com wrote: Hi All! I have been running Hadoop jobs through my user account on a cluster, for a while now. But now I am getting this strange exception when I try to execute a job. If anyone knows, please let me know why this is happening. [akhil1...@altocumulus WordCount]$ hadoop jar wordcount_classes_dir.jar org.uiuc.upcrc.extClasses.WordCount /home/akhil1988/input /home/akhil1988/output JO 09/06/22 19:19:01 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. org.apache.hadoop.ipc.RemoteException: java.io.FileNotFoundException: /hadoop/tmp/hadoop/mapred/local/jobTracker/job_200906111015_0167.xml (Read-only file system) at java.io.FileOutputStream.open(Native Method) at java.io.FileOutputStream.init(FileOutputStream.java:179) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:187) at org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileOutputStream.init(RawLocalFileSystem.java:183) at org.apache.hadoop.fs.RawLocalFileSystem.create(RawLocalFileSystem.java:241) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSOutputSummer.init(ChecksumFileSystem.java:327) at org.apache.hadoop.fs.ChecksumFileSystem.create(ChecksumFileSystem.java:360) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:487) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:468) at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:375) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:208) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:142) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1214) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1195) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:212) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:2230) at sun.reflect.GeneratedMethodAccessor22.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:452) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:892) at org.apache.hadoop.ipc.Client.call(Client.java:696) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.mapred.$Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:828) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1127) at org.uiuc.upcrc.extClasses.WordCount.main(WordCount.java:70) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Thanks, Akhil -- View this message in context: http://www.nabble.com/Strange-Exeception-tp24158395p24158395.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: When is configure and close run
configure and close are run for each task, mapper and reducer. The configure and close are NOT run on the combiner class. On Mon, Jun 22, 2009 at 9:23 AM, Saptarshi Guha saptarshi.g...@gmail.comwrote: Hello, In a mapreduce job, a given map JVM will run N map tasks. Are the configure and close methods executed for every one of these N tasks? Or is configure executed once when the JVM starts and the close method executed once when all N have been completed? I have the same question for the reduce task. Will it be run before for every reduce task? And close is run when all the values for a given key have been processed? We can assume there isn't a combiner. Regards Saptarshi -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Determining input record directory using Streaming...
Check the process environment for your streaming tasks, generally the configuration variables are exported into the process environment. The Mapper input file is normally stored as some variant of mapred.input.file. The reducer's input is the mapper output for that reduce, so the input file is not relevant. On Mon, Jun 22, 2009 at 7:21 PM, C G parallel...@yahoo.com wrote: Hi All: Is there any way using Hadoop Streaming to determining the directory from which an input record is being read? This is straightforward in Hadoop using InputFormats, but I am curious if the same concept can be applied to streaming. The goal here is to read in data from 2 directories, say A/ and B/, and make decisions about what to do based on where the data is rooted. Thanks for any help...CG -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Too many open files error, which gets resolved after some time
HDFS/DFS client uses quite a few file descriptors for each open file. Many application developers (but not the hadoop core) rely on the JVM finalizer methods to close open files. This combination, expecially when many HDFS files are open can result in very large demands for file descriptors for Hadoop clients. We as a general rule never run a cluster with nofile less that 64k, and for larger clusters with demanding applications have had it set 10x higher. I also believe there was a set of JVM versions that leaked file descriptors used for NIO in the HDFS core. I do not recall the exact details. On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. After tracing some more with the lsof utility, and I managed to stop the growth on the DataNode process, but still have issues with my DFS client. It seems that my DFS client opens hundreds of pipes and eventpolls. Here is a small part of the lsof output: java10508 root 387w FIFO0,6 6142565 pipe java10508 root 388r FIFO0,6 6142565 pipe java10508 root 389u 0,100 6142566 eventpoll java10508 root 390u FIFO0,6 6135311 pipe java10508 root 391r FIFO0,6 6135311 pipe java10508 root 392u 0,100 6135312 eventpoll java10508 root 393r FIFO0,6 6148234 pipe java10508 root 394w FIFO0,6 6142570 pipe java10508 root 395r FIFO0,6 6135857 pipe java10508 root 396r FIFO0,6 6142570 pipe java10508 root 397r 0,100 6142571 eventpoll java10508 root 398u FIFO0,6 6135319 pipe java10508 root 399w FIFO0,6 6135319 pipe I'm using FSDataInputStream and FSDataOutputStream, so this might be related to pipes? So, my questions are: 1) What happens these pipes/epolls to appear? 2) More important, how I can prevent their accumation and growth? Thanks in advance! 2009/6/21 Stas Oskin stas.os...@gmail.com Hi. I have HDFS client and HDFS datanode running on same machine. When I'm trying to access a dozen of files at once from the client, several times in a row, I'm starting to receive the following errors on client, and HDFS browse function. HDFS Client: Could not get block locations. Aborting... HDFS browse: Too many open files I can increase the maximum number of files that can opened, as I have it set to the default 1024, but would like to first solve the problem, as larger value just means it would run out of files again later on. So my questions are: 1) Does the HDFS datanode keeps any files opened, even after the HDFS client have already closed them? 2) Is it possible to find out, who keeps the opened files - datanode or client (so I could pin-point the source of the problem). Thanks in advance! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Too many open files error, which gets resolved after some time
Just to be clear, I second Brian's opinion. Relying on finalizes is a very good way to run out of file descriptors. On Sun, Jun 21, 2009 at 9:32 AM, brian.lev...@nokia.com wrote: IMHO, you should never rely on finalizers to release scarce resources since you don't know when the finalizer will get called, if ever. -brian -Original Message- From: ext jason hadoop [mailto:jason.had...@gmail.com] Sent: Sunday, June 21, 2009 11:19 AM To: core-user@hadoop.apache.org Subject: Re: Too many open files error, which gets resolved after some time HDFS/DFS client uses quite a few file descriptors for each open file. Many application developers (but not the hadoop core) rely on the JVM finalizer methods to close open files. This combination, expecially when many HDFS files are open can result in very large demands for file descriptors for Hadoop clients. We as a general rule never run a cluster with nofile less that 64k, and for larger clusters with demanding applications have had it set 10x higher. I also believe there was a set of JVM versions that leaked file descriptors used for NIO in the HDFS core. I do not recall the exact details. On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. After tracing some more with the lsof utility, and I managed to stop the growth on the DataNode process, but still have issues with my DFS client. It seems that my DFS client opens hundreds of pipes and eventpolls. Here is a small part of the lsof output: java10508 root 387w FIFO0,6 6142565 pipe java10508 root 388r FIFO0,6 6142565 pipe java10508 root 389u 0,100 6142566 eventpoll java10508 root 390u FIFO0,6 6135311 pipe java10508 root 391r FIFO0,6 6135311 pipe java10508 root 392u 0,100 6135312 eventpoll java10508 root 393r FIFO0,6 6148234 pipe java10508 root 394w FIFO0,6 6142570 pipe java10508 root 395r FIFO0,6 6135857 pipe java10508 root 396r FIFO0,6 6142570 pipe java10508 root 397r 0,100 6142571 eventpoll java10508 root 398u FIFO0,6 6135319 pipe java10508 root 399w FIFO0,6 6135319 pipe I'm using FSDataInputStream and FSDataOutputStream, so this might be related to pipes? So, my questions are: 1) What happens these pipes/epolls to appear? 2) More important, how I can prevent their accumation and growth? Thanks in advance! 2009/6/21 Stas Oskin stas.os...@gmail.com Hi. I have HDFS client and HDFS datanode running on same machine. When I'm trying to access a dozen of files at once from the client, several times in a row, I'm starting to receive the following errors on client, and HDFS browse function. HDFS Client: Could not get block locations. Aborting... HDFS browse: Too many open files I can increase the maximum number of files that can opened, as I have it set to the default 1024, but would like to first solve the problem, as larger value just means it would run out of files again later on. So my questions are: 1) Does the HDFS datanode keeps any files opened, even after the HDFS client have already closed them? 2) Is it possible to find out, who keeps the opened files - datanode or client (so I could pin-point the source of the problem). Thanks in advance! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Too many open files error, which gets resolved after some time
Yes. Otherwise the file descriptors will flow away like water. I also strongly suggest having at least 64k file descriptors as the open file limit. On Sun, Jun 21, 2009 at 12:43 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. Thanks for the advice. So you advice explicitly closing each and every file handle that I receive from HDFS? Regards. 2009/6/21 jason hadoop jason.had...@gmail.com Just to be clear, I second Brian's opinion. Relying on finalizes is a very good way to run out of file descriptors. On Sun, Jun 21, 2009 at 9:32 AM, brian.lev...@nokia.com wrote: IMHO, you should never rely on finalizers to release scarce resources since you don't know when the finalizer will get called, if ever. -brian -Original Message- From: ext jason hadoop [mailto:jason.had...@gmail.com] Sent: Sunday, June 21, 2009 11:19 AM To: core-user@hadoop.apache.org Subject: Re: Too many open files error, which gets resolved after some time HDFS/DFS client uses quite a few file descriptors for each open file. Many application developers (but not the hadoop core) rely on the JVM finalizer methods to close open files. This combination, expecially when many HDFS files are open can result in very large demands for file descriptors for Hadoop clients. We as a general rule never run a cluster with nofile less that 64k, and for larger clusters with demanding applications have had it set 10x higher. I also believe there was a set of JVM versions that leaked file descriptors used for NIO in the HDFS core. I do not recall the exact details. On Sun, Jun 21, 2009 at 5:27 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. After tracing some more with the lsof utility, and I managed to stop the growth on the DataNode process, but still have issues with my DFS client. It seems that my DFS client opens hundreds of pipes and eventpolls. Here is a small part of the lsof output: java10508 root 387w FIFO0,6 6142565 pipe java10508 root 388r FIFO0,6 6142565 pipe java10508 root 389u 0,100 6142566 eventpoll java10508 root 390u FIFO0,6 6135311 pipe java10508 root 391r FIFO0,6 6135311 pipe java10508 root 392u 0,100 6135312 eventpoll java10508 root 393r FIFO0,6 6148234 pipe java10508 root 394w FIFO0,6 6142570 pipe java10508 root 395r FIFO0,6 6135857 pipe java10508 root 396r FIFO0,6 6142570 pipe java10508 root 397r 0,100 6142571 eventpoll java10508 root 398u FIFO0,6 6135319 pipe java10508 root 399w FIFO0,6 6135319 pipe I'm using FSDataInputStream and FSDataOutputStream, so this might be related to pipes? So, my questions are: 1) What happens these pipes/epolls to appear? 2) More important, how I can prevent their accumation and growth? Thanks in advance! 2009/6/21 Stas Oskin stas.os...@gmail.com Hi. I have HDFS client and HDFS datanode running on same machine. When I'm trying to access a dozen of files at once from the client, several times in a row, I'm starting to receive the following errors on client, and HDFS browse function. HDFS Client: Could not get block locations. Aborting... HDFS browse: Too many open files I can increase the maximum number of files that can opened, as I have it set to the default 1024, but would like to first solve the problem, as larger value just means it would run out of files again later on. So my questions are: 1) Does the HDFS datanode keeps any files opened, even after the HDFS client have already closed them? 2) Is it possible to find out, who keeps the opened files - datanode or client (so I could pin-point the source of the problem). Thanks in advance! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Restrict output of mappers to reducers running on same node?
Yes, you are correct. I had not thought about sharing a file handle through multiple tasks via jvm reuse. On Thu, Jun 18, 2009 at 9:43 AM, Tarandeep Singh tarand...@gmail.comwrote: Jason, correct me if I am wrong- opening Sequence file in the configure (or setup method in 0.20) and writing to it is same as doing output.collect( ), unless you mean I should make the sequence file writer static variable and set reuse jvm flag to -1. In that case the subsequent mappers might be run in the same JVM and they can use the same writer and hence produce one file. But in that case I need to add a hook to close the writer - may be use the shutdown hook. Jothi, the idea of combine input format is good. But I guess I have to write somethign of my own to make it work in my case. Thanks guys for the suggestions... but I feel we should have some support from the framework to merge the output of mapper only job so that we don't get a lot number of smaller files. Sometimes you just don't want to run reducers and unnecessarily transfer a whole lot of data across the network. Thanks, Tarandeep On Wed, Jun 17, 2009 at 7:57 PM, jason hadoop jason.had...@gmail.com wrote: You can open your sequence file in the mapper configure method, write to it in your map, and close it in the mapper close method. Then you end up with 1 sequence file per map. I am making an assumption that each key,value to your map some how represents a single xml file/item. On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan joth...@yahoo-inc.com wrote: You could look at CombineFileInputFormat to generate a single split out of several files. Your partitioner would be able to assign keys to specific reducers, but you would not have control on which node a given reduce task will run. Jothi On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote: Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same node? Let me explain why I want to do this- I am converting huge number of XML files into SequenceFiles. So theoretically I don't even need reducers, mappers would read xml files and output Sequencefiles. But the problem with this approach is I will end up getting huge number of small output files. To avoid generating large number of smaller files, I can Identity reducers. But by running reducers, I am unnecessarily transfering data over network. I ran some test case using a small subset of my data (~90GB). With map only jobs, my cluster finished conversion in only 6 minutes. But with map and Identity reducers job, it takes around 38 minutes. I have to process close to a terabyte of data. So I was thinking of a faster alternatives- * Writing a custom OutputFormat * Somehow restrict output of mappers running on a node to go to reducers running on the same node. May be I can write my own partitioner (simple) but not sure how Hadoop's framework assigns partitions to reduce tasks. Any pointers ? Or this is not possible at all ? Thanks, Tarandeep -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
You can pass the -D mapred.child.java.opts=-Xmx[some value] as part of your job, or set it in your job conf before your task is submitted. THen the per task jvm's will use that string as part of the jvm initialization paramter set The distributed cache is used for making files and archives that are stored in hdfs, available in the local file system working area of your tasks. The GenericOptionsParser class that most Hadoop user interfaces use, provides a couple of command line arguments that allow you to specify local file system files which are copied into hdfs and then made avilable as stated above -files and libjars are the to arguments. My book has a solid discussion and example set for the distributed cache in chapter 5. On Thu, Jun 18, 2009 at 1:45 PM, akhil1988 akhilan...@gmail.com wrote: Hi Jason! I finally found out that there was some problem in reserving the HEAPSIZE which I have resolved now. Actually we cannot change the HADOOP_HEAPSIZE using export from our user account, after we have started the Hadoop. It has to changed by the root. I have a user account on the cluster and I was trying to change the Hadoop_heapsize from my user account using 'export' which had no effect. So I had to request my cluster administrator to increase the HADOOP_HEAPSIZE in hadoop-env.sh and then restart hadoop. Now the program is running absolutely fine. Thanks for your help. One thing that I would like to ask you is that can we use DistributerCache for transferring directories to the local cache of the tasks? Thanks, Akhil akhil1988 wrote: Hi Jason! Thanks for going with me to solve my problem. To restate things and make it more easier to understand: I am working in local mode in the directory which contains the job jar and also the Config and Data directories. I just removed the following three statements from my code: DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); The program executes till the same point as before now also and terminates. That means the above three statements are of no use while working in local mode. In local mode, the working directory for the mapreduce tasks becomes the current woking direcotry in which you started the hadoop command to execute the job. Since I have removed the DistributedCache.add. statements there should be no issue whether I am giving a file name or a directory name as argument to it. Now it seems to me that there is some problem in reading the binary file using binaryRead. Please let me know if I am going wrong anywhere. Thanks, Akhil jason hadoop wrote: I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache is not generally used for passing data, but for passing file names. The files must be stored in a shared file system (hdfs for simplicity) already. The distributed cache makes the names available to the tasks, and the the files are extracted from hdfs and stored in the task local work area on each task tracker node. It looks like you may be storing the contents of your files in the distributed cache. On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote: Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java
Re: JobControl for Pipes?
http://www.cascading.org/ https://issues.apache.org/jira/browse/HADOOP-5303 (oozie) On Thu, Jun 18, 2009 at 6:19 AM, Roshan James roshan.james.subscript...@gmail.com wrote: Can you give me a url or so to both? I cant seem to find either one after a couple of basic web searches. Also, when you say JobControl is coming to Hadoop - I can already see the Java JobControl classes that lets one express dependancies between jobs. So I assume this already works in Java - does it not? I was asking if this functionality is exposed via Pipes in some way. Roshan On Wed, Jun 17, 2009 at 10:59 PM, jason hadoop jason.had...@gmail.com wrote: Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43 PM, Roshan James roshan.james.subscript...@gmail.com wrote: Hello, Is there any way to express dependencies between map-reduce jobs (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs? The provided header Pipes.hh does not seem to reflect any such capabilities. best, Roshan -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Getting Task ID inside a Mapper
The task id is readily available, if you override the configure method. The MapReduceBase class in the Pro Hadoop Book examples does this and makes the taskId available as a class field. On Thu, Jun 18, 2009 at 7:33 AM, Mark Desnoyer mdesno...@gmail.com wrote: Thanks! I'll try that. -Mark On Thu, Jun 18, 2009 at 10:27 AM, Jingkei Ly jingkei...@detica.com wrote: I think you can use job.getInt(mapred.task.partition,-1) to get the mapper ID, which should be the same for the mapper across task reruns. -Original Message- From: Piotr Praczyk [mailto:piotr.prac...@gmail.com] Sent: 18 June 2009 15:19 To: core-user@hadoop.apache.org Subject: Re: Getting Task ID inside a Mapper Hi Why don't you provide a seed of random generator generated outside the task ? Then when the task fails, you can provide the same value stored somewhere outside. You could use the task configuration to do so. I don't know anything about obtaining the task ID from within. regards Piotr 2009/6/18 Mark Desnoyer mdesno...@gmail.com Hi, I was wondering if it's possible to get a hold of the task id inside a mapper? I cant' seem to find a way by trolling through the API reference. I'm trying to implement a Map Reduce version of Latent Dirichlet Allocation and I need to be able to initialize a random number generator in a task specific way so that if the task fails and is rerun elsewhere, the results are the same. Thanks in advance. Cheers, Mark Desnoyer This message should be regarded as confidential. If you have received this email in error please notify the sender and destroy it immediately. Statements of intent shall only become binding when confirmed in hard copy by an authorised signatory. The contents of this email may relate to dealings with other companies within the Detica Group plc group of companies. Detica Limited is registered in England under No: 1337451. Registered offices: Surrey Research Park, Guildford, Surrey, GU2 7YP, England. -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Practical limit on emitted map/reduce values
In general if the values become very large, it becomes simpler to store them outline in hdfs, and just pass the hdfs path for the item as the value in the map reduce task. This greatly reduces the amount of IO done, and doesn't blow up the sort space on the reducer. You loose the magic of data locality, but given the item size, and you gain the IO back by not having to pass the full values to the reducer, or handle them when sorting the map outputs. On Thu, Jun 18, 2009 at 8:45 AM, Leon Mergen l.p.mer...@solatis.com wrote: Hello Owen, Keys and values can be large. They are certainly capped above by Java's 2GB limit on byte arrays. More practically, you will have problems running out of memory with keys or values of 100 MB. There is no restriction that a key/value pair fits in a single hdfs block, but performance would suffer. (In particular, the FileInputFormats split at block sized chunks, which means you will have maps that scan an entire block without processing anything.) Thanks for the quick reply. Could you perhaps elaborate on that 100 MB limit ? Is that due to a limit that is caused by the Java VM heap size ? If so, could that, for example, be increased to 512MB by setting mapred.child.java.opts to '-Xmx512m' ? Regards, Leon Mergen -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
I have only ever used the distributed cache to add files, including binary files such as shared libraries. It looks like you are adding a directory. The DistributedCache is not generally used for passing data, but for passing file names. The files must be stored in a shared file system (hdfs for simplicity) already. The distributed cache makes the names available to the tasks, and the the files are extracted from hdfs and stored in the task local work area on each task tracker node. It looks like you may be storing the contents of your files in the distributed cache. On Wed, Jun 17, 2009 at 6:56 AM, akhil1988 akhilan...@gmail.com wrote: Thanks Jason. I went inside the code of the statement and found out that it eventually makes some binaryRead function call to read a binary file and there it strucks. Do you know whether there is any problem in giving a binary file for addition to the distributed cache. In the statement DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); Data is a directory which contains some text as well as some binary files. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); I can see(in the output messages) that it is able to read the text files but it gets struck at the binary files. So, I think here the problem is: it is not able to read the binary files which either have not been transferred to the cache or a binary file cannot be read. Do you know the solution to this? Thanks, Akhil jason hadoop wrote: Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text{ private Text word = new Text(); private static NETaggerLevel1 tagger1 = new NETaggerLevel1(); private static NETaggerLevel2 tagger2 = new NETaggerLevel2(); Map(){ System.out.println(HI2\n); Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); System.out.println(HI3\n); Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true); System.out.println(loading the tagger); tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1); System.out.println(HI5\n); tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2); System.out.println(Done- loading the tagger); } public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter ) throws IOException { String inputline = value.toString(); /* Processing of the input pair is done here */ } public static void main(String [] args) throws Exception { JobConf conf = new JobConf(NerTagger.class); conf.setJobName(NerTagger); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setNumReduceTasks(0); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.set(mapred.job.tracker, local); conf.set(fs.default.name, file:///); DistributedCache.addCacheFile(new
Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon corrected code is LUCKYOU
You can purchase the ebook from www.apress.com. The final copy is now available. There is a 50% off coupon good for a few more days, LUCKYOU. you can try prohadoop.ning.com as an alternative for www.prohadoopbook.com, or www.prohadoop.com. What error do you receive when you try to visit www.prohadoopbook.com ? 2009/6/17 zjffdu zjf...@gmail.com HI Jason, Where can I download your books' Alpha Chapters, I am very interested in your book about hadoop. And I cannot visit the link www.prohadoopbook.com -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: 2009年6月9日 20:47 To: core-user@hadoop.apache.org Subject: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon corrected code is LUCKYOU http://eBookshop.apress.com CODE LUCKYOU -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Restrict output of mappers to reducers running on same node?
You can open your sequence file in the mapper configure method, write to it in your map, and close it in the mapper close method. Then you end up with 1 sequence file per map. I am making an assumption that each key,value to your map some how represents a single xml file/item. On Wed, Jun 17, 2009 at 7:29 PM, Jothi Padmanabhan joth...@yahoo-inc.comwrote: You could look at CombineFileInputFormat to generate a single split out of several files. Your partitioner would be able to assign keys to specific reducers, but you would not have control on which node a given reduce task will run. Jothi On 6/18/09 5:10 AM, Tarandeep Singh tarand...@gmail.com wrote: Hi, Can I restrict the output of mappers running on a node to go to reducer(s) running on the same node? Let me explain why I want to do this- I am converting huge number of XML files into SequenceFiles. So theoretically I don't even need reducers, mappers would read xml files and output Sequencefiles. But the problem with this approach is I will end up getting huge number of small output files. To avoid generating large number of smaller files, I can Identity reducers. But by running reducers, I am unnecessarily transfering data over network. I ran some test case using a small subset of my data (~90GB). With map only jobs, my cluster finished conversion in only 6 minutes. But with map and Identity reducers job, it takes around 38 minutes. I have to process close to a terabyte of data. So I was thinking of a faster alternatives- * Writing a custom OutputFormat * Somehow restrict output of mappers running on a node to go to reducers running on the same node. May be I can write my own partitioner (simple) but not sure how Hadoop's framework assigns partitions to reduce tasks. Any pointers ? Or this is not possible at all ? Thanks, Tarandeep -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: JobControl for Pipes?
Job control is coming with the Hadoop WorkFlow manager, in the mean time there is cascade by chris wensel. I do not have any personal experience with either. I do not know how pipes interacts with either. On Wed, Jun 17, 2009 at 12:43 PM, Roshan James roshan.james.subscript...@gmail.com wrote: Hello, Is there any way to express dependencies between map-reduce jobs (such as in org.apache.hadoop.mapred.jobcontrol) for pipes jobs? The provided header Pipes.hh does not seem to reflect any such capabilities. best, Roshan -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Can I share datas for several map tasks?
In the examples for my book is a jvm reuse with static data shared between jvm's example On Tue, Jun 16, 2009 at 1:08 AM, Hello World snowlo...@gmail.com wrote: Thanks for your reply. Can you do me a favor to make a check? I modified mapred-default.xml as follows: 540 property 541 namemapred.job.reuse.jvm.num.tasks/name 542 value-1/value 543 descriptionHow many tasks to run per jvm. If set to -1, there is 544 no limit. 545 /description 546 /property And execute bin/stop-all.sh; bin/start-all.sh to restart hadoop; This is my program: 17 public class WordCount { 18 19 public static class TokenizerMapper 20extends MapperObject, Text, Text, IntWritable{ 21 22 private final static IntWritable one = new IntWritable(1); 23 private Text word = new Text(); 24 public static int[] ToBeSharedData = new int[1024 * 1024 * 16]; 25 26 protected void setup(Context context 27 ) throws IOException, InterruptedException { 28 //Init shared data 29 ToBeSharedData[0] = 12345; 30 System.out.println(setup shared data[0] = + ToBeSharedData[0]); 31 } 32 33 public void map(Object key, Text value, Context context 34 ) throws IOException, InterruptedException { 35 StringTokenizer itr = new StringTokenizer(value.toString()); 36 while (itr.hasMoreTokens()) { 37 word.set(itr.nextToken()); 38 context.write(word, one); 39 } 40 System.out.println(read shared data[0] = + ToBeSharedData[0]); 41 } 42 } First, can you tell me how to make sure jvm reuse is taking effect, for I didn't see anything different from before. I use top command under linux and see the same number of java processes and same memory usage. Second, can you tell me how to make the ToBeSharedData be inited only once and can be read from other MapTasks on the same node? Or this is not a suitable programming style for map-reduce? By the way, I'm using hadoop-0.20.0, in pseudo-distributed mode on a single-node. thanks in advance On Tue, Jun 16, 2009 at 1:48 PM, Sharad Agarwal shara...@yahoo-inc.com wrote: snowloong wrote: Hi, I want to share some data structures for the map tasks on a same node(not through files), I mean, if one map task has already initialized some data structures (e.g. an array or a list), can other map tasks share these memorys and directly access them, for I don't want to reinitialize these datas and I want to save some memory. Can hadoop help me do this? You can enable jvm reuse across tasks. See mapred.job.reuse.jvm.num.tasks in mapred-default.xml for usage. Then you can cache the data in a static variable in your mapper. - Sharad -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Datanodes fail to start
I often find myself editing the src/saveVersion.sh to fake out the version numbers, when I build a hadoop jar for the first time, and have to deploy it on an an already running cluster. On Mon, Jun 15, 2009 at 11:57 PM, Ian jonhson jonhson@gmail.com wrote: If you rebuilt the hadoop, following the wikipage of HowToRelease may reduce the trouble occurred. On Sat, May 16, 2009 at 7:20 AM, Pankil Doshiforpan...@gmail.com wrote: I got the solution.. Namespace IDs where some how incompatible.So I had to clean data dir and temp dir ,format the cluster and make a fresh start Pankil On Fri, May 15, 2009 at 2:25 AM, jason hadoop jason.had...@gmail.com wrote: There should be a few more lines at the end. We only want the part from last the STARTUP_MSG to the end On one of mine a successfull start looks like this: STARTUP_MSG: Starting DataNode STARTUP_MSG: host = at/192.168.1.119 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.19.1-dev STARTUP_MSG: build = -r ; compiled by 'jason' on Tue Mar 17 04:03:57 PDT 2009 / 2009-03-17 03:08:11,884 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Registered FSDatasetStatusMBean 2009-03-17 03:08:11,886 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Opened info server at 50010 2009-03-17 03:08:11,889 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Balancing bandwith is 1048576 bytes/s 2009-03-17 03:08:12,142 INFO org.mortbay.http.HttpServer: Version Jetty/5.1.4 2009-03-17 03:08:12,155 INFO org.mortbay.util.Credential: Checking Resource aliases 2009-03-17 03:08:12,518 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@1e184cb 2009-03-17 03:08:12,578 INFO org.mortbay.util.Container: Started WebApplicationContext[/static,/static] 2009-03-17 03:08:12,721 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@1d9e282 2009-03-17 03:08:12,722 INFO org.mortbay.util.Container: Started WebApplicationContext[/logs,/logs] 2009-03-17 03:08:12,878 INFO org.mortbay.util.Container: Started org.mortbay.jetty.servlet.webapplicationhand...@14a75bb 2009-03-17 03:08:12,884 INFO org.mortbay.util.Container: Started WebApplicationContext[/,/] 2009-03-17 03:08:12,951 INFO org.mortbay.http.SocketListener: Started SocketListener on 0.0.0.0:50075 2009-03-17 03:08:12,951 INFO org.mortbay.util.Container: Started org.mortbay.jetty.ser...@1358f03 2009-03-17 03:08:12,957 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=DataNode, sessionId=null 2009-03-17 03:08:13,242 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=DataNode, port=50020 2009-03-17 03:08:13,264 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2009-03-17 03:08:13,304 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 50020: starting 2009-03-17 03:08:13,343 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 50020: starting 2009-03-17 03:08:13,343 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: dnRegistration = DatanodeRegistration(192.168.1.119:50010, storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075, ipcPort=50020) 2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 50020: starting 2009-03-17 03:08:13,344 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 50020: starting 2009-03-17 03:08:13,351 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.1.119:50010, storageID=DS-540597485-192.168.1.119-50010-1237022386925, infoPort=50075, ipcPort=50020)In DataNode.run, data = FSDataset{dirpath='/tmp/hadoop-0.19.0-jason/dfs/data/current'} 2009-03-17 03:08:13,352 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: using BLOCKREPORT_INTERVAL of 360msec Initial delay: 0msec 2009-03-17 03:08:13,391 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: BlockReport of 14 blocks got processed in 27 msecs 2009-03-17 03:08:13,392 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting Periodic block scanner. On Thu, May 14, 2009 at 9:51 PM, Pankil Doshi forpan...@gmail.com wrote: This is log from datanode. 2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 12 msecs 2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 8 msecs 2009-05-14 02:36:13,975 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 9 msecs 2009-05-14 03:36:15,189 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 12 msecs 2009-05-14 04:36:13,384 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 9
Re: Debugging Map-Reduce programs
When you are running in local mode you have 2 basic choices if you want to interact with a debugger. You can launch from within eclipse or other IDE, or you can setup a java debugger transport as part of the mapred.child.java.opts variable, and attach to the running jvm. By far the simplest is loading via eclipse. Your other alternative is to inform the framework to retain the job files via keep.failed.task.files (be careful here you will fill your disk with old dead data) and use the debug the IsolationRunner Examples in my book :) On Mon, Jun 15, 2009 at 6:49 PM, bharath vissapragada bharathvissapragada1...@gmail.com wrote: I am running in a local mode . Can you tell me how to set those breakpoints or how to access those files so that i can debug the program. The program is generating = java.lang.NumberFormatException: For input string: But that particular string is the one which is the input to the mapclass . So I think that it is not reading my input correctly .. But when i try to print the same .. it isn't printing to the STDOUT .. Iam using the FileInputFormat class FileInputFormat.addInputPath(conf, new Path(/home/rip/Desktop/hadoop-0.18.3/input)); FileOutputFormat.setOutputPath(conf, new Path(/home/rip/Desktop/hadoop-0.18.3/output)); input and output are folders for inp and outpt. It is generating these warnings also 09/06/16 12:38:32 WARN fs.FileSystem: local is a deprecated filesystem name. Use file:/// instead. Thanks in advance On Tue, Jun 16, 2009 at 3:50 AM, Aaron Kimball aa...@cloudera.com wrote: On Mon, Jun 15, 2009 at 10:01 AM, bharath vissapragada bhara...@students.iiit.ac.in wrote: Hi all , When running hadoop in local mode .. can we use print statements to print something to the terminal ... Yes. In distributed mode, each task will write its stdout/stderr to files which you can access through the web-based interface. Also iam not sure whether the program is reading my input files ... If i keep print statements it isn't displaying any .. can anyone tell me how to solve this problem. Is it generating exceptions? Are the files present? If you're running in local mode, you can use a debugger; set a breakpoint in your map() method and see if it gets there. How are you configuring the input files for your job? Thanks in adance, -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.amazon.com/dp/1430219424?tag=jewlerymall www.prohadoopbook.com a community for Hadoop Professionals
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Is it possible that your map class is an inner class and not static? On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote: Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program is a direct porting of my original sequential code. There is no reduce phase. Basically, I have just put my program in the map class. My program takes around 1-2 min. in instantiating the data objects which are present in the constructor of Map class(it loads some data model files, therefore it takes some time). After the instantiation part in the constrcutor of Map class the map function is supposed to process the input split. The problem is that the data objects do not get instantiated completely and in between(whlie it is still in constructor) the program stops giving the exceptions pasted at bottom. The program runs fine without mapreduce and does not require more than 2GB memory, but in mapreduce even after doing export HADOOP_HEAPSIZE=2500(I am working on machines with 16GB RAM), the program fails. I have also set HADOOP_OPTS=-server -XX:-UseGCOverheadLimit as sometimes I was getting GC Overhead Limit Exceeded exceptions also. Somebody, please help me with this problem: I have trying to debug it for the last 3 days, but unsuccessful. Thanks! java.lang.OutOfMemoryError: Java heap space at sun.misc.FloatingDecimal.toJavaFormatString(FloatingDecimal.java:889) at java.lang.Double.toString(Double.java:179) at java.text.DigitList.set(DigitList.java:272) at java.text.DecimalFormat.format(DecimalFormat.java:584) at java.text.DecimalFormat.format(DecimalFormat.java:507) at java.text.NumberFormat.format(NumberFormat.java:269) at org.apache.hadoop.util.StringUtils.formatPercent(StringUtils.java:110) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1147) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) 09/06/16 12:34:41 WARN mapred.LocalJobRunner: job_local_0001 java.lang.RuntimeException: java.lang.reflect.InvocationTargetException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:81) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:79) ... 5 more Caused by: java.lang.ThreadDeath at java.lang.Thread.stop(Thread.java:715) at org.apache.hadoop.mapred.LocalJobRunner.killJob(LocalJobRunner.java:310) at org.apache.hadoop.mapred.JobClient$NetworkedJob.killJob(JobClient.java:315) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1224) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) -- View this message in context:
Re: Nor OOM Java Heap Space neither GC OverHead Limit Exeeceded
Something is happening inside of your (Parameters. readConfigAndLoadExternalData(Config/allLayer1.config);) code, and the framework is killing the job for not heartbeating for 600 seconds On Tue, Jun 16, 2009 at 8:32 PM, akhil1988 akhilan...@gmail.com wrote: One more thing, finally it terminates there (after some time) by giving the final Exception: java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LbjTagger.NerTagger.main(NerTagger.java:109) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) akhil1988 wrote: Thank you Jason for your reply. My Map class is an inner class and it is a static class. Here is the structure of my code. public class NerTagger { public static class Map extends MapReduceBase implements MapperLongWritable, Text, Text, Text{ private Text word = new Text(); private static NETaggerLevel1 tagger1 = new NETaggerLevel1(); private static NETaggerLevel2 tagger2 = new NETaggerLevel2(); Map(){ System.out.println(HI2\n); Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); System.out.println(HI3\n); Parameters.forceNewSentenceOnLineBreaks=Boolean.parseBoolean(true); System.out.println(loading the tagger); tagger1=(NETaggerLevel1)Classifier.binaryRead(Parameters.pathToModelFile+.level1); System.out.println(HI5\n); tagger2=(NETaggerLevel2)Classifier.binaryRead(Parameters.pathToModelFile+.level2); System.out.println(Done- loading the tagger); } public void map(LongWritable key, Text value, OutputCollectorText, Text output, Reporter reporter ) throws IOException { String inputline = value.toString(); /* Processing of the input pair is done here */ } public static void main(String [] args) throws Exception { JobConf conf = new JobConf(NerTagger.class); conf.setJobName(NerTagger); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setNumReduceTasks(0); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); conf.set(mapred.job.tracker, local); conf.set(fs.default.name, file:///); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Data/), conf); DistributedCache.addCacheFile(new URI(/home/akhil1988/Ner/OriginalNer/Config/), conf); DistributedCache.createSymlink(conf); conf.set(mapred.child.java.opts,-Xmx4096m); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); System.out.println(HI1\n); JobClient.runJob(conf); } Jason, when the program executes HI1 and HI2 are printed but it does not reaches HI3. In the statement Parameters.readConfigAndLoadExternalData(Config/allLayer1.config); it is able to access Config/allLayer1.config file (as while executing this statement, it prints some messages like which data it is loading, etc.) but it gets stuck there(while loading some classifier) and never reaches HI3. This program runs fine when executed normally(without mapreduce). Thanks, Akhil jason hadoop wrote: Is it possible that your map class is an inner class and not static? On Tue, Jun 16, 2009 at 10:51 AM, akhil1988 akhilan...@gmail.com wrote: Hi All, I am running my mapred program in local mode by setting mapred.jobtracker.local to local mode so that I can debug my code. The mapred program is a direct porting of my original sequential code. There is no reduce phase. Basically, I have just put my program in the map class. My program takes around 1-2 min. in instantiating the data objects which are present in the constructor of Map class(it loads some
Re: java.lang.ClassNotFoundException
Your class is not in your jar, or your jar is not avialable in the hadoop class path. On Mon, Jun 15, 2009 at 2:39 AM, bharath vissapragada bhara...@students.iiit.ac.in wrote: Hi all , When i try to run my own progam (jar file) i get the following error. java.lang.ClassNotFoundException : file-Name.class at java.net.URLClassLoader$1.run(URLClassLoader.java:217) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:205) at java.lang.ClassLoader.loadClass(ClassLoader.java:323) at java.lang.ClassLoader.loadClass(ClassLoader.java:268) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:336) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:264) at org.apache.hadoop.util.RunJar.main(RunJar.java:148) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68 Can anyone tell me the reason(s) for this .. Thanks in advance -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: get number of values for a key
It would be nice if there was an interface compliant way. Perhaps it becomes available in the 0.20 and beyond api's. On Sat, Jun 13, 2009 at 3:40 PM, Rares Vernica rvern...@gmail.com wrote: Hello, In Reduce, can I get the number of values for the current key without iterating over them? Does Hadoop has this number? Or, at least the total number of pairs that will be processed by the current Reduce instance. I am pretty sure that Hadoop already knows this number because it sorted them. BTW, the iterators given to Reduce are one-time use iterators, right? Thanks! Rares -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: 2009 Hadoop Summit West - was wonderful
This is the best I have at present: http://www.cs.cmu.edu/~priya/ On Sat, Jun 13, 2009 at 11:05 AM, zsongbo zson...@gmail.com wrote: Hi Jason, Could you please post more information about Pria Narasimhan's toolset for automated fault detection in hadoop clusters? Such as url or others. Thanks. Schubert On Thu, Jun 11, 2009 at 11:26 AM, jason hadoop jason.had...@gmail.com wrote: I had a great time, smoozing with people, and enjoyed a couple of the talks I would love to see more from Pria Narasimhan, hope their toolset for automated fault detection in hadoop clusters becomes generally available. Zookeeper rocks on! Hbase is starting to look really good, in 0.20 the master node and the single point of failure and configuration headache goes away and Zookeeper takes over. Owen O'Mally ave a solid presentation on the new Hadoop API's and the reasons for the changes. It was good to hang with everyone, see you all next year! I even got to spend a little time chatting with Tom White, and a signed copy of his book, thanks Tom! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: The behavior of HashPartitioner
You can always write something simple to hand call the HashPartitioner. Jython works for quick tests. But the code in hash partitioner is essentially ((int) key.hashcode()) % num reduces. Since nothing else is in play, I suspect there is an incorrect assumption somewhere. On Fri, Jun 12, 2009 at 11:25 AM, Zhengguo 'Mike' SUN zhengguo...@yahoo.com wrote: Hi, The intermediate key generated by my Mappers is IntWritable. I tested with different number of Reducers. When the number of Reducers is the same as the number of different keys of intermediate output. It partitions perfectly. Each Reducer receives one input group. When these two numbers are different, the partitioning function becomes difficult to understand. For example, when the number of keys is less than the number of Reducers, I am expecting that each Reducer at most receive one input group. But it turns out that many Reducers receive more than one input group. On the other hand, when the number of keys is larger than the number of Reducers, I am expecting that each Reducer at least receive one input group. But it turns out that some Reducers receive nothing to process. The expectation I had is from the implementation of HashPartitioner class, which just uses modulo operator with the number of Reducers to generate partitions. Anyone has any insights into this? -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: HDFS data transfer!
Also check the IO wait time on your datanodes, if the io wait time is high, you can't win. On Fri, Jun 12, 2009 at 11:24 AM, Brian Bockelman bbock...@cse.unl.eduwrote: What's your replication factor? What aggregate I/O rates do you see in Ganglia? Is the I/O spikey, or has it plateaued? We can hit close to network rate (1Gbps) per node locally, and have pretty similar hardware. Brian On Jun 12, 2009, at 9:03 AM, Scott wrote: I ran the put command on 3 of the nodes simultaneously to copy files that were local on those machines into the hdfs. Brian Bockelman wrote: What'd you do for the tests? Was it a single stream or a multiple stream test? Brian On Jun 12, 2009, at 6:48 AM, Scott wrote: So is ~ 1GB/minute transfer rate a reasonable performance benchmark? Our test cluster consists of 4 quad core xeon machines with 2 non-raided drives each. My initial tests show a transfer rate of around 1GB/minute, and that was slower that I expected it to be. Thanks, Scott Brian Bockelman wrote: Hey Sugandha, Transfer rates depend on the quality/quantity of your hardware and the quality of your client disk that is generating the data. I usually say that you should expect near-hardware-bottleneck speeds for an otherwise idle cluster. There should be no make it fast required (though you should reviewi the logs for errors if it's going slow). I would expect a 5GB file to take around 3-5 minutes to write on our cluster, but it's a well-tuned and operational cluster. As Todd (I think) mentioned before, we can't help any when you say I want to make it faster. You need to provide diagnostic information - logs, Ganglia plots, stack traces, something - that folks can look at. Brian On Jun 10, 2009, at 2:25 AM, Sugandha Naolekar wrote: But if I want to make it fast, then??? I want to place the data in HDFS and reoplicate it in fraction of seconds. Can that be possible. and How? On Wed, Jun 10, 2009 at 2:47 PM, kartik saxena kartik@gmail.com wrote: I would suppose about 2-3 hours. It took me some 2 days to load a 160 Gb file. Secura On Wed, Jun 10, 2009 at 11:56 AM, Sugandha Naolekar sugandha@gmail.comwrote:It Hello! If I try to transfer a 5GB VDI file from a remote host(not a part of hadoop cluster) into HDFS, and get it back, how much time is it supposed to take? No map-reduce involved. Simply Writing files in and out from HDFS through a simple code of java (usage of API's). -- Regards! Sugandha -- Regards! Sugandha -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Hadoop streaming - No room for reduce task error
The reduce output may spill to disk during the sort, and if it expected to be larger than the partition free space, unless the machine/jvm has a hugh allowed memory space, the data will spill to disk during the sort. If I did my math correctly, you are trying to push ~2TB through the single reduce. as for the part- files, if you have the number of reduces set to zero, you will get N part files, where N is the number of map tasks. If you absolutely must have it all go to one reduce, you will need to increase the free disk space. I think 19.1 preserves compression for the map output, so you could try enabling compression for map output. If you have many nodes, you can set the number of reduces to some number and then use sort -M on the part files, to merge sort them, assuming your reduce preserves ordering. Try adding these parameters to your job line: -D mapred.compress.map.output=true -D mapred.output.compression.type=BLOCK BTW, /bin/cat works fine as an identity mapper or an identity reducer On Wed, Jun 10, 2009 at 5:31 PM, Todd Lipcon t...@cloudera.com wrote: Hey Scott, It turns out that Alex's answer was mistaken - your error is actually coming from lack of disk space on the TT that has been assigned the reduce task. Specifically, there is not enough space in mapred.local.dir. You'll need to change your mapred.local.dir to point to a partition that has enough space to contain your reduce output. As for why this is the case, I hope someone will pipe up. It seems to me that reduce output can go directly to the target filesystem without using space on mapred.local.dir. Thanks -Todd On Wed, Jun 10, 2009 at 4:58 PM, Alex Loddengaard a...@cloudera.com wrote: What is mapred.child.ulimit set to? This configuration options specifics how much memory child processes are allowed to have. You may want to up this limit and see what happens. Let me know if that doesn't get you anywhere. Alex On Wed, Jun 10, 2009 at 9:40 AM, Scott skes...@weather.com wrote: Complete newby map/reduce question here. I am using hadoop streaming as I come from a Perl background, and am trying to prototype/test a process to load/clean-up ad server log lines from multiple input files into one large file on the hdfs that can then be used as the source of a hive db table. I have a perl map script that reads an input line from stdin, does the needed cleanup/manipulation, and writes back to stdout.I don't really need a reduce step, as I don't care what order the lines are written in, and there is no summary data to produce. When I run the job with -reducer NONE I get valid output, however I get multiple part-x files rather than one big file. So I wrote a trivial 'reduce' script that reads from stdin and simply splits the key/value, and writes the value back to stdout. I am executing the code as follows: ./hadoop jar ../contrib/streaming/hadoop-0.19.1-streaming.jar -mapper /usr/bin/perl /home/hadoop/scripts/map_parse_log_r2.pl -reducer /usr/bin/perl /home/hadoop/scripts/reduce_parse_log.pl -input /logs/*.log -output test9 The code I have works when given a small set of input files. However, I get the following error when attempting to run the code on a large set of input files: hadoop-hadoop-jobtracker-testdw0b00.log.2009-06-09:2009-06-09 15:43:00,905 WARN org.apache.hadoop.mapred.JobInProgress: No room for reduce task. Node tracker_testdw0b00:localhost.localdomain/127.0.0.1:53245 has 2004049920 bytes free; but we expect reduce input to take 22138478392 I assume this is because the all the map output is being buffered in memory prior to running the reduce step? If so, what can I change to stop the buffering? I just need the map output to go directly to one large file. Thanks, Scott -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Large size Text file split
There is always NLineInputFormat. You specify the number of lines per split. The key is the position of the line start in the file, value is the line itself. The parameter mapred.line.input.format.linespermap controls the number of lines per split On Wed, Jun 10, 2009 at 5:27 AM, Harish Mallipeddi harish.mallipe...@gmail.com wrote: On Wed, Jun 10, 2009 at 5:36 PM, Wenrui Guo wenrui@ericsson.com wrote: Hi, all I have a large csv file ( larger than 10 GB ), I'd like to use a certain InputFormat to split it into smaller part thus each Mapper can deal with piece of the csv file. However, as far as I know, FileInputFormat only cares about byte size of file, that is, the class can divide the csv file as many part, and maybe some part is not a well-format CVS file. For example, one line of the CSV file is not terminated with CRLF, or maybe some text is trimed. How to ensure each FileSplit is a smaller valid CSV file using a proper InputFormat? BR/anderson If all you care about is the splits occurring at line boundaries, then TextInputFormat will work. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/TextInputFormat.html If not I guess you can write your own InputFormat class. -- Harish Mallipeddi http://blog.poundbang.in -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
2009 Hadoop Summit West - was wonderful
I had a great time, smoozing with people, and enjoyed a couple of the talks I would love to see more from Pria Narasimhan, hope their toolset for automated fault detection in hadoop clusters becomes generally available. Zookeeper rocks on! Hbase is starting to look really good, in 0.20 the master node and the single point of failure and configuration headache goes away and Zookeeper takes over. Owen O'Mally ave a solid presentation on the new Hadoop API's and the reasons for the changes. It was good to hang with everyone, see you all next year! I even got to spend a little time chatting with Tom White, and a signed copy of his book, thanks Tom! -- Pro Hadoop, a book to guide you from beginner to hadoop mastery, http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,
I just sent a note to the publisher, hopefully they will fix it, especially since I just printed up 100 flyers to give out at the hadoop summit! On Tue, Jun 9, 2009 at 7:37 PM, Burt Beckwith b...@burtbeckwith.com wrote: Does this apply to the printed book or eBook? I tried it with the eBook but it didn't work: ERROR: The promotional code 'LUCKYYOU' does not exist. Burt On Tuesday 09 June 2009 10:15:24 pm jason hadoop wrote: In honor of the Hadoop Summit on June 10th(tomorrow), Apress has agreed to provide some conference swag, in the form of a 50% off coupon Purchase the book at http://eBookshop.apress.com and use code LUCKYYOU, for 50% off the list price. The coupon has a short valid time so don't delay your purchase :) -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: [ADV] Blatant marketing of the book Pro Hadoop. In honor of the 09 summit here is a 50% off coupon,
CORRECTED CODE, LUCKYOU I miss read the flyer. On Tue, Jun 9, 2009 at 8:45 PM, jason hadoop jason.had...@gmail.com wrote: I just sent a note to the publisher, hopefully they will fix it, especially since I just printed up 100 flyers to give out at the hadoop summit! On Tue, Jun 9, 2009 at 7:37 PM, Burt Beckwith b...@burtbeckwith.comwrote: Does this apply to the printed book or eBook? I tried it with the eBook but it didn't work: ERROR: The promotional code 'LUCKYYOU' does not exist. Burt On Tuesday 09 June 2009 10:15:24 pm jason hadoop wrote: In honor of the Hadoop Summit on June 10th(tomorrow), Apress has agreed to provide some conference swag, in the form of a 50% off coupon Purchase the book at http://eBookshop.apress.com and use code LUCKYYOU, for 50% off the list price. The coupon has a short valid time so don't delay your purchase :) -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: What should I do to implements writable?
A writeable basically needs to implement two methods: /** * Serialize the fields of this object to codeout/code. * * @param out codeDataOuput/code to serialize this object into. * @throws IOException */ void write(DataOutput out) throws IOException; /** * Deserialize the fields of this object from codein/code. * * pFor efficiency, implementations should attempt to re-use storage in the * existing object where possible./p * * @param in codeDataInput/code to deseriablize this object from. * @throws IOException */ void readFields(DataInput in) throws IOException; These use the serialization primitives to pack and unpack the object. These are from the hadoop Text class in 0.19.1 /** deserialize */ public void readFields(DataInput in) throws IOException { int newLength = WritableUtils.readVInt(in); setCapacity(newLength, false); in.readFully(bytes, 0, newLength); length = newLength; } /** serialize * write this object to out * length uses zero-compressed encoding * @see Writable#write(DataOutput) */ public void write(DataOutput out) throws IOException { WritableUtils.writeVInt(out, length); out.write(bytes, 0, length); } If you look at the various implementors of Writable, you will find plenty of examples in the source tree. For very complex objects, the simplest thing to do is to serialize them to an ObjectOutputStream that is backed by a ByteArrayOutputStream then use the write/read byte array method on the DataOutput/DataInput In my code I check to see if in/out implement the the ObjectXStream and use it directly, or use an intermediate byte array. On Tue, Jun 9, 2009 at 9:23 AM, HRoger hanxianyongro...@163.com wrote: hello: I write a class containing a treeset to implements writable , how should I to implements the two methods? Any help would be appreciated! -- View this message in context: http://www.nabble.com/What-should-I-do-to-implements-writable--tp23946397p23946397.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Map-Reduce!
A very common one is processing large quantities of log files and producing summary date. Another use is simply as a way of distributing large jobs across multiple computers. In a previous job, we used Map/Reduce for distributed bulk web crawling, and for distributed media file processing. On Mon, Jun 8, 2009 at 3:54 AM, Sugandha Naolekar sugandha@gmail.comwrote: Hello! As far as I have read the forums of Map-reduce, it is basically used to process large amount of data speedily. right?? But, can you please give me some instances or examples wherein, I can use map-reduce..??? -- Regards! Sugandha -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Is there any way to debug the hadoop job in eclipse
The chapters are available for download now. On Sat, Jun 6, 2009 at 3:33 AM, zhang jianfeng zjf...@gmail.com wrote: Is there any resource on internet that I can get as soon as possible ? On Fri, Jun 5, 2009 at 6:43 PM, jason hadoop jason.had...@gmail.com wrote: chapter 7 of my book goes into details of hour to debug with eclipse On Fri, Jun 5, 2009 at 3:40 AM, zhang jianfeng zjf...@gmail.com wrote: Hi all, Some jobs I submit to hadoop failed, but I can not see what's the problem. So is there any way to debug the hadoop job in eclipse, such as the remote debug. or others ways to find the job failed reason. I didnot find enough information in the job tracker. Thank you. Jeff Zhang -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Is there any way to debug the hadoop job in eclipse
chapter 7 of my book goes into details of hour to debug with eclipse On Fri, Jun 5, 2009 at 3:40 AM, zhang jianfeng zjf...@gmail.com wrote: Hi all, Some jobs I submit to hadoop failed, but I can not see what's the problem. So is there any way to debug the hadoop job in eclipse, such as the remote debug. or others ways to find the job failed reason. I didnot find enough information in the job tracker. Thank you. Jeff Zhang -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Task files in _temporary not getting promoted out
Are your tasks failing or completing successfully. Failed tasks have the output directory wiped, only successfully completed tasks have the files moved up. I don't recall if the FileOutputCommitter class appeared in 0.18 On Wed, Jun 3, 2009 at 6:43 PM, Ian Soboroff ian.sobor...@nist.gov wrote: Ok, help. I am trying to create local task outputs in my reduce job, and they get created, then go poof when the job's done. My first take was to use FileOutputFormat.getWorkOutputPath, and create directories in there for my outputs (which are Lucene indexes). Exasperated, I then wrote a small OutputFormat/RecordWriter pair to write the indexes. In each case, I can see directories being created in attempt_foo/_temporary, but when the task is over they're gone. I've stared at TextOutputFormat and I can't figure out why it's files survive and mine don't. Help! Again, this is 0.18.3. Thanks, Ian -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: problem getting map input filename
you can always dump the entire property space and work it out that way. I haven't used the 0.20 api's yet so I can't speak to them On Tue, Jun 2, 2009 at 10:52 AM, Rares Vernica rvern...@gmail.com wrote: On 6/2/09, randy...@comcast.net randy...@comcast.net wrote: Your Map class needs to have a configure method that can access the JobConf. Like this: In Hadoop 0.20 the new Mapper class does no longer have configure and JobConf has been replaced with Job. In the Mapper methods, you now get a Context object. public void configure(JobConf conf) { System.out.println(conf.get(map.input.file); } -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Reduce() time takes ~4x Map()
At the minimal level, enable map output compression, it may make some difference, mapred.compress.map.output. Sorting is very expensive when there are many keys and the values are large. Are you quite certain your keys are unique. Also, do you need them sorted by document id? On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan joth...@yahoo-inc.comwrote: Hi David, If you go to JobTrackerHistory and then click on this job and then do Analyse This Job, you should be able to get the split up timings for the individual phases of the map and reduce tasks, including the average, best and worst times. Could you provide those numbers so that we can get a better idea of how the job progressed. Jothi On 5/28/09 10:11 PM, David Batista dsbati...@xldb.di.fc.ul.pt wrote: Hi everyone, I'm processing XML files, around 500MB each with several documents, for the map() function I pass a document from the XML file, which takes some time to process depending on the size - I'm applying NER to texts. Each document has a unique identifier, so I'm using that identifier as a key and the results of parsing the document in one string as the output: so at the end of the map function(): output.collect( new Text(identifier), new Text(outputString)); usually the outputString is around 1k-5k size reduce(): public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) { while (values.hasNext()) { Text text = values.next(); try { output.collect(key, text); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } I did a test using only 1 machine with 8 cores, and only 1 XML file, it took around 3 hours to process all maps and ~12hours for the reduces! the XML file has 139 945 documents I set the jobconf for 1000 maps() and 200 reduces() I did took a look at graphs on the web interface during the reduce phase, and indeed its the copy phase that's taking much of the time, the sort and reduce phase are done almost instantly. Why does the copy phase takes so long? I understand that the copies are made using HTTP, and the data was in really small chunks 1k-5k size, but even so, being everything in the same physical machine should have been faster, no? Any suggestions on what might be causing the copies in reduce to take so long? -- ./david -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Efficient algorithm for many-to-many reduce-side join?
Use the mapside join stuff, if I understand your problem it provides a good solution but requires getting over the learning hurdle. Well described in chapter 8 of my book :) On Thu, May 28, 2009 at 8:29 AM, Chris K Wensel ch...@wensel.net wrote: I believe PIG, and I know Cascading use a kind of 'spillable' list that can be re-iterated across. PIG's version is a bit more sophisticated last I looked. that said, if you were using either one of them, you wouldn't need to write your own many-to-many join. cheers, ckw On May 28, 2009, at 8:14 AM, Todd Lipcon wrote: One last possible trick to consider: If you were to subclass SequenceFileRecordReader, you'd have access to its seek method, allowing you to rewind the reducer input. You could then implement a block hash join with something like the following pseudocode: ahash = new HashMapKey, Val(); while (i have ram available) { read a record if the record is from table B, break put the record into ahash } nextAPos = reader.getPos() while (current record is an A record) { skip to next record } firstBPos = reader.getPos() while (current record has current key) { read and join against ahash process joined result } if firstBPos nextAPos { seek(nextAPos) go back to top } On Thu, May 28, 2009 at 8:05 AM, Todd Lipcon t...@cloudera.com wrote: Hi Stuart, It seems to me like you have a few options. Option 1: Just use a lot of RAM. Unless you really expect many millions of entries on both sides of the join, you might be able to get away with buffering despite its inefficiency. Option 2: Use LocalDirAllocator to find some local storage to spill all of the left table's records to disk in a MapFile format. Then as you iterate over the right table, do lookups in the MapFile. This is really the same as option 1, except that you're using disk as an extension of RAM. Option 3: Convert this to a map-side merge join. Basically what you need to do is sort both tables by the join key, and partition them with the same partitioner into the same number of columns. This way you have an equal number of part-N files for both tables, and within each part-N file they're ordered by join key. In each map task, you open both tableA/part-N and tableB/part-N and do a sequential merge to perform the join. I believe the CompositeInputFormat class helps with this, though I've never used it. Option 4: Perform the join in several passes. Whichever table is smaller, break into pieces that fit in RAM. Unless my relational algebra is off, A JOIN B = A JOIN (B1 UNION B2) = (A JOIN B1 UNION A JOIN B2) if B = B1 UNION B2. Hope that helps -Todd On Thu, May 28, 2009 at 5:02 AM, Stuart White stuart.whi...@gmail.com wrote: I need to do a reduce-side join of two datasets. It's a many-to-many join; that is, each dataset can can multiple records with any given key. Every description of a reduce-side join I've seen involves constructing your keys out of your mapper such that records from one dataset will be presented to the reducers before records from the second dataset. I should hold on to the value from the one dataset and remember it as I iterate across the values from the second dataset. This seems like it only works well for one-to-many joins (when one of your datasets will only have a single record with any given key). This scales well because you're only remembering one value. In a many-to-many join, if you apply this same algorithm, you'll need to remember all values from one dataset, which of course will be problematic (and won't scale) when dealing with large datasets with large numbers of records with the same keys. Does an efficient algorithm exist for a many-to-many reduce-side join? -- Chris K Wensel ch...@concurrentinc.com http://www.concurrentinc.com -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: avoid custom crawler getting blocked
Random ordering helps with per thread delays based on domain recency also helps. On Wed, May 27, 2009 at 6:47 AM, Ken Krugler kkrugler_li...@transpac.comwrote: My current project is to gather stats from a lot of different documents. We're are not indexing just getting quite specific stats for each document. We gather 12 different stats from each document. Our requirements have changed somewhat now, originally it was working on documents from our own servers but now it needs to fetch other ones from quite a large variety of sources. My approach up to now was to have the map function simply take each filepath (or now URL) in turn, fetch the document, calculate the stats and output those stats. My new problem is some of the locations we are now visiting don't like their IP being hit multiple times in a row. Is it possible to check a URL against a visited list of IPs and if recently visited either wait for a certain amount of time or push it back onto the input stack so it will be processed later in the queue? Or is there a better way? Your use case is very similar to what we've been doing with Bixo. See http://bixo.101tec.com, and also http://bixo.101tec.com/wp-content/uploads/2009/05/bixo-intro.pdf Short answer is that we group URLs by paid-level domain in a map (actually using a Cascading GroupBy operation), and use per-domain queues with multi-threaded fetchers to efficiently load pages in a reduce (a Cascading Buffer operation). -- Ken -- Ken Krugler +1 530-210-6378 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Specifying NameNode externally to hadoop-site.xml
if you launch your jobs via bin/hadoop jar jar_file [main class] [options] you can simply specify -fs hdfs://host:port before the jar_file On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. I'm looking to move the Hadoop NameNode URL outside the hadoop-site.xml file, so I could set it at the run-time. Any idea how to do it? Or perhaps there is another configuration that can be applied to the FileSystem object? Regards. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Specifying NameNode externally to hadoop-site.xml
conf.set(fs.default.name, hdfs://host:port); where conf is the JobConf object of your job, before you submit it. On Mon, May 25, 2009 at 10:16 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Thanks for the tip, but is it possible to set this in dynamic way via code? Thanks. 2009/5/25 jason hadoop jason.had...@gmail.com if you launch your jobs via bin/hadoop jar jar_file [main class] [options] you can simply specify -fs hdfs://host:port before the jar_file On Sun, May 24, 2009 at 3:02 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. I'm looking to move the Hadoop NameNode URL outside the hadoop-site.xml file, so I could set it at the run-time. Any idea how to do it? Or perhaps there is another configuration that can be applied to the FileSystem object? Regards. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Circumventing Hadoop's data placement policy
Can you give your machines multiple IP addresses, and bind the grid server to a different IP than the datanode With solaris you could put it in a different zone, On Sat, May 23, 2009 at 10:13 AM, Brian Bockelman bbock...@math.unl.eduwrote: Hey all, Had a problem I wanted to ask advice on. The Caltech site I work with currently have a few GridFTP servers which are on the same physical machines as the Hadoop datanodes, and a few that aren't. The GridFTP server has a libhdfs backend which writes incoming network data into HDFS. They've found that the GridFTP servers which are co-located with HDFS datanode have poor performance because data is incoming at a much faster rate than the HDD can handle. The standalone GridFTP servers, however, push data out to multiple nodes at one, and can handle the incoming data just fine (200MB/s). Is there any way to turn off the preference for the local node? Can anyone think of a good workaround to trick HDFS into thinking the client isn't on the same node? Brian -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Could only be replicated to 0 nodes, instead of 1
It does not appear that any datanodes have connected to your namenode. on the datanode machines look in the hadoop logs directory at the datanode log files. There should be some information there that helps you diagnose the problem. chapter 4 of my book provides some detail on work with this problem On Thu, May 21, 2009 at 4:29 AM, ashish pareek pareek...@gmail.com wrote: Hi , I have two suggestion i)Choose a right version ( Hadoop- 0.18 is good) ii)replication should be 3 as ur having 3 modes.( Indirectly see to it that ur configuration is correct !!) Hey even i am just suggesting this as i am also a new to hadoop Ashish Pareek On Thu, May 21, 2009 at 2:41 PM, Stas Oskin stas.os...@gmail.com wrote: Hi. I'm testing Hadoop in our lab, and started getting the following message when trying to copy a file: Could only be replicated to 0 nodes, instead of 1 I have the following setup: * 3 machines, 2 of them with only 80GB of space, and 1 with 1.5GB * Two clients are copying files all the time (one of them is the 1.5GB machine) * The replication is set on 2 * I let the space on 2 smaller machines to end, to test the behavior Now, one of the clients (the one located on 1.5GB) works fine, and the other one - the external, unable to copy and displays the error + the exception below Any idea if this expected on my scenario? Or how it can be solved? Thanks in advance. 09/05/21 10:51:03 WARN dfs.DFSClient: NotReplicatedYetException sleeping /test/test.bin retries left 1 09/05/21 10:51:06 WARN dfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /test/test.bin could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1123 ) at org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:330) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25 ) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:481) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:890) at org.apache.hadoop.ipc.Client.call(Client.java:716) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25 ) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82 ) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59 ) at org.apache.hadoop.dfs.$Proxy0.addBlock(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2450 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2333 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1800(DFSClient.java:1745 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1922 ) 09/05/21 10:51:06 WARN dfs.DFSClient: Error Recovery for block null bad datanode[0] java.io.IOException: Could not get block locations. Aborting... at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:2153 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1400(DFSClient.java:1745 ) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1899 ) -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Multipleoutput file
setInputPaths will take an array, or variable arguments. or you can simply provide the directory that the individual files reside in, and the individual files will be added. If there are other files in the directory, you may need to specify a custom input path filter via FileInputFormat.setInputPathFilter. 2009/5/21 皮皮 pi.bingf...@gmail.com yes , but how can i get the commaSeperatedPaths? As i can't specify it handy. it's not practicable to do that: commaSeperatedPaths_1 = MAPPINGOUTPUT-r-1; commaSeperatedPaths_2 = MAPPINGOUTPUT-r-2; FileInputFormat.setInputPaths(job, commaSeperatedPaths_1); FileInputFormat.setInputPaths(job, commaSeperatedPaths_2); 2009/4/7 Brian MacKay brian.mac...@medecision.com Not sure about your question: seems like you'd like to do this...? After you run job, your output may be like MAPPINGOUTPUT-r-1, MAPPINGOUTPUT-r-2, etc. You'd need to set them as multiple inputs. FileInputFormat.setInputPaths(job, commaSeperatedPaths); Brian -Original Message- From: 皮皮 [mailto:pi.bingf...@gmail.com] Sent: Tuesday, April 07, 2009 3:30 AM To: core-user@hadoop.apache.org Subject: Re: Multiple k,v pairs from a single map - possible? could any body tell me how to get one of the multipleoutput file in another jobconfig? 2009/4/3 皮皮 pi.bingf...@gmail.com thank you very much . this is what i am looking for. 2009/3/27 Brian MacKay brian.mac...@medecision.com Amandeep, Add this to your driver. MultipleOutputs.addNamedOutput(conf, PHONE,TextOutputFormat.class, Text.class, Text.class); MultipleOutputs.addNamedOutput(conf, NAME, TextOutputFormat.class, Text.class, Text.class); And in your reducer private MultipleOutputs mos; public void reduce(Text key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) { // namedOutPut = either PHONE or NAME while (values.hasNext()) { String value = values.next().toString(); mos.getCollector(namedOutPut, reporter).collect( new Text(value), new Text(othervals)); } } @Override public void configure(JobConf conf) { super.configure(conf); mos = new MultipleOutputs(conf); } public void close() throws IOException { mos.close(); } By the way, have you had a change to post your Oracle fix to DBInputFormat ? If so, what is the Jira tag #? Brian -Original Message- From: Amandeep Khurana [mailto:ama...@gmail.com] Sent: Friday, March 27, 2009 5:46 AM To: core-user@hadoop.apache.org Subject: Multiple k,v pairs from a single map - possible? Is it possible to output multiple key value pairs from a single map function run? For example, the mapper outputing name,phone and name, address simultaneously... Can I write multiple output.collect(...) commands? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this message in error, please contact the sender and delete the material from any computer. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Randomize input file?
The last time I had to do something like this, in the map phase, I made the key a random value, md5 of the key, and built a new value that had the real key embedded. Then in the reduce phase I received the records in random order and could do what I wanted. By using a stable but differently sorting value for the key, my reduce still grouped correctly but I received the calls to reduce in a random order compared to the normal sort order of the data. On Thu, May 21, 2009 at 12:25 PM, Alex Loddengaard a...@cloudera.comwrote: Bhupesh, I forgot to say that the concatenation phase of my plan would concatenate randomly. As I mentioned, this wouldn't be a good way to randomize, but it'd be pretty easy. Anyway, your solution is much more clever and does a better job randomizing. Good thinking! Thanks, Alex On Thu, May 21, 2009 at 11:36 AM, Bhupesh Bansal bban...@linkedin.com wrote: Hmm , IMHO running a mapper only job will give you an output file With same order. You should write a custom map-reduce job Where map emits (Key:Integer.random() , value:line) And reducer Output (key:NOTHING , value:line) Reducer will sort on Integer.random() giving you a random ordering for your input file. Best Bhupesh On 5/21/09 11:15 AM, Alex Loddengaard a...@cloudera.com wrote: Hi John, I don't know of a built-in way to do this. Depending on how well you want to randomize, you could just run a MapReduce job with at least one map (the more maps, the more random) and no reduces. When you run a job with no reduces, the shuffle phase is skipped entirely, and the intermediate outputs from the mappers are stored directly to HDFS. Though I think each mapper will create one HDFS file, so you'll have to concatenate all files into a single file. The above isn't a very good way to randomize, but it's fairly easy to implement and should run pretty quickly. Hope this helps. Alex On Thu, May 21, 2009 at 7:18 AM, John Clarke clarke...@gmail.com wrote: Hi, I have a need to randomize my input file before processing. I understand I can chain Hadoop jobs together so the first could take the input file randomize it and then the second could take the randomized file and do the processing. The input file has one entry per line and I want to mix up the lines before the main processing. Is there an inbuilt ability I have missed or will I have to try and write a Hadoop program to shuffle my input file? Cheers, John -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Optimal Filesystem (and Settings) for HDFS
I always disable atime and it's ilk The deadline scheduler helps with the (non xfs hanging) du datanode timeout issues, but not much. Ultimately that is a caching failure in the kernel, due to the hadoop io patterns. Anshu, any luck getting off the PAE kernels? Is this the xfs lockup, or just the du taking to long? At one point, sagar and I talked about replacing the du call with a script that used the df as a rapid and close proxy, to get rid of the du calls, the block report was another problem On Tue, May 19, 2009 at 3:59 PM, Anshuman Sachdeva asachd...@attributor.com wrote: Hi Brian, thanks for the mail. I have an issue when we use xfs. hadoop runs du -sk after every 10 min on my cluster and some times it goes in the loop and machine hangs. Have you seen this issue or its only me? I'll really appreciate if some one can put some light on this Anshuman - Original Message - From: Bryan Duxbury br...@rapleaf.com To: core-user@hadoop.apache.org Sent: Tuesday, May 19, 2009 2:50:57 PM GMT -08:00 US/Canada Pacific Subject: Re: Optimal Filesystem (and Settings) for HDFS We use XFS for our data drives, and we've had somewhat mixed results. One of the biggest pros is that XFS has more free space than ext3, even with the reserved space settings turned all the way to 0. Another is that you can format a 1TB drive as XFS in about 0 seconds, versus minutes for ext3. This makes it really fast to kickstart our worker nodes. We have seen some weird stuff happen though when machines run out of memory, apparently because the XFS driver does something odd with kernel memory. When this happens, we end up having to do some fscking before we can get that node back online. As far as outright performance, I actually *did* do some tests of xfs vs ext3 performance on our cluster. If you just look at a single machine's local disk speed, you can write and read noticeably faster when using XFS instead of ext3. However, the reality is that this extra disk performance won't have much of an effect on your overall job completion performance, since you will find yourself network bottlenecked well in advance of even ext3's performance. The long and short of it is that we use XFS to speed up our new machine deployment, and that's it. -Bryan On May 18, 2009, at 10:31 AM, Alex Loddengaard wrote: I believe Yahoo! uses ext3, though I know other people have said that XFS has performed better in various benchmarks. We use ext3, though we haven't done any benchmarks to prove its worth. This question has come up a lot, so I think it'd be worth doing a benchmark and writing up the results. I haven't been able to find a detailed analysis / benchmark writeup comparing various filesystems, unfortunately. Hope this helps, Alex On Mon, May 18, 2009 at 8:54 AM, Bob Schulze b.schu...@ecircle.com wrote: We are currently rebuilding our cluster - has anybody recommendations on the underlaying file system? Just standard Ext3? I could imagine that the block size could be larger than its default... Thx for any tips, Bob -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: FSDataOutputStream flush() not working?
When you open a file you have the option, blockSize /** * Opens an FSDataOutputStream at the indicated Path with write-progress * reporting. * @param f the file name to open * @param permission * @param overwrite if a file with this name already exists, then if true, * the file will be overwritten, and if false an error will be thrown. * @param bufferSize the size of the buffer to be used. * @param replication required block replication for the file. * @param blockSize * @param progress * @throws IOException * @see #setPermission(Path, FsPermission) */ public abstract FSDataOutputStream create(Path f, FsPermission permission, boolean overwrite, int bufferSize, short replication, long blockSize, Progressable progress) throws IOException; On Fri, May 15, 2009 at 12:44 PM, Sasha Dolgy sdo...@gmail.com wrote: Hi Todd, Reading through the JIRA, my impression is that data will be written out to hdfs only once it has reached a certain size in the buffer. Is it possible to define the size of that buffer? Or is this a future enhancement? -sasha On Fri, May 15, 2009 at 6:14 PM, Todd Lipcon t...@cloudera.com wrote: Hi Sasha, What version are you running? Up until very recent versions, sync() was not implemented. Even in the newest releases, sync isn't completely finished, and you may find unreliable behavior. For now, if you need this kind of behavior, your best bet is to close each file and then open the next every N minutes. For example, if you're processing logs every 5 minutes, simply close log file log.00223 and round robin to log.00224 right before you need the data to be available to readers. If you're collecting data at a low rate, these files may end up being rather small, and you should probably look into doing merges on the hour/day/etc to avoid small-file proliferation. If you want to track the work being done around append and sync, check out HADOOP-5744 and the issues referenced therein: http://issues.apache.org/jira/browse/HADOOP-5744 Hope that helps, -Todd On Fri, May 15, 2009 at 6:35 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi there, forgive the repost: Right now data is received in parallel and is written to a queue, then a single thread reads the queue and writes those messages to a FSDataOutputStream which is kept open, but the messages never get flushed. Tried flush() and sync() with no joy. 1. outputStream.writeBytes(rawMessage.toString()); 2. log.debug(Flushing stream, size = + s.getOutputStream().size()); s.getOutputStream().sync(); log.debug(Flushed stream, size = + s.getOutputStream().size()); or log.debug(Flushing stream, size = + s.getOutputStream().size()); s.getOutputStream().flush(); log.debug(Flushed stream, size = + s.getOutputStream().size()); The size() remains the same after performing this action. 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream 2009-05-12 12:42:17,470 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 1986 2009-05-12 12:42:17,472 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 1613 bytesCurBlock 1986 lastFlushOffset 1731 2009-05-12 12:42:17,472 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 1986 2009-05-12 12:42:19,586 DEBUG [Thread-7] (HdfsQueueConsumer.java:39) hdfs.HdfsQueueConsumer: Consumer writing event 2009-05-12 12:42:19,587 DEBUG [Thread-7] (FSStreamManager.java:28) hdfs.HdfsQueueConsumer: Thread 19 getting an output stream 2009-05-12 12:42:19,588 DEBUG [Thread-7] (FSStreamManager.java:49) hdfs.HdfsQueueConsumer: Re-using existing stream 2009-05-12 12:42:19,589 DEBUG [Thread-7] (FSStreamManager.java:63) hdfs.HdfsQueueConsumer: Flushing stream, size = 2235 2009-05-12 12:42:19,589 DEBUG [Thread-7] (DFSClient.java:3013) hdfs.DFSClient: DFSClient flush() : saveOffset 2125 bytesCurBlock 2235 lastFlushOffset 1986 2009-05-12 12:42:19,590 DEBUG [Thread-7] (FSStreamManager.java:66) hdfs.HdfsQueueConsumer: Flushed stream, size = 2235 So although the Offset is changing as expected, the output stream isn't being flushed or cleared out and isn't being written to file unless the stream is closed() ... is this the expected behaviour? -sd -- Sasha Dolgy sasha.do...@gmail.com -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Datanodes fail to start
got processed in 11 msecs 2009-05-14 13:36:15,454 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 12 msecs 2009-05-14 14:36:13,662 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 9 msecs 2009-05-14 15:36:14,930 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 13 msecs 2009-05-14 16:36:16,151 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 12 msecs 2009-05-14 17:36:14,407 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 9 msecs 2009-05-14 18:36:15,659 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 10 msecs 2009-05-14 19:27:02,188 WARN org.apache.hadoop.dfs.DataNode: java.io.IOException: Call to hadoopmaster.utdallas.edu/10.110.95.61:9000failed on local except$ at org.apache.hadoop.ipc.Client.wrapException(Client.java:751) at org.apache.hadoop.ipc.Client.call(Client.java:719) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:216) at org.apache.hadoop.dfs.$Proxy4.sendHeartbeat(Unknown Source) at org.apache.hadoop.dfs.DataNode.offerService(DataNode.java:690) at org.apache.hadoop.dfs.DataNode.run(DataNode.java:2967) at java.lang.Thread.run(Thread.java:619) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:500) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:442) 2009-05-14 19:27:06,198 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: hadoopmaster.utdallas.edu/10.110.95.61:9000. Already tried 0 time(s). 2009-05-14 19:27:06,436 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at Slave1/127.0.1.1 / 2009-05-14 19:27:21,737 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = Slave1/127.0.1.1 On Thu, May 14, 2009 at 11:43 PM, jason hadoop jason.had...@gmail.com wrote: The data node logs are on the datanode machines in the log directory. You may wish to buy my book and read chapter 4 on hdfs management. On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com wrote: Can u guide me where can I find datanode log files? As I cannot find it in $hadoop/logs and so. I can only find following files in logs folder :- hadoop-hadoop-namenode-hadoopmaster.log hadoop-hadoop-namenode-hadoopmaster.out hadoop-hadoop-namenode-hadoopmaster.out.1 hadoop-hadoop-secondarynamenode-hadoopmaster.log hadoop-hadoop-secondarynamenode-hadoopmaster.out hadoop-hadoop-secondarynamenode-hadoopmaster.out.1 history Thanks Pankil On Thu, May 14, 2009 at 11:27 PM, jason hadoop jason.had...@gmail.com wrote: You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node passively waits for the datanodes to connect to it. On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote: Hello Everyone, Actually I had a cluster which was up. But i stopped the cluster as i wanted to format it.But cant start it back. 1)when i give start-dfs.sh I get following on screen starting namenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out slave1.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out slave3.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out slave4.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out slave2.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out slave5.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out slave6.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out slave9.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out slave8.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out slave7.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out slave10.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out hadoopmaster.local: starting
Re: hadoop streaming binary input / image processing
My apologies Piotr, I was referring to the streaming case and then pulling the file out of a shared file systems, not using an input split that contains the image data as you suggest. On Thu, May 14, 2009 at 11:50 PM, Piotr Praczyk piotr.prac...@gmail.comwrote: Depends what API do you use. When writing an InputSplit implementation, it is possible to specify on which nodes does the data reside. I am new to Hadoop, but as far as I know, doing this should enable the support for data locality. Moreover, implementing a subclass of TextInputFormat and adding some encoding on the fly should not impact any locality properties. Piotr 2009/5/15 jason hadoop jason.had...@gmail.com A downside of this approach is that you will not likely have data locality for the data on shared file systems, compared with data coming from an input split. That being said, from your script, *hadoop dfs -get FILE -* will write the file to standard out. On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk piotr.prac...@gmail.com wrote: just in addition to my previous post... You don't have to store the enceded files in a file system of course since you can write your own InoutFormat which wil do this on the fly... the overhead should not be that big. Piotr 2009/5/14 Piotr Praczyk piotr.prac...@gmail.com Hi If you want to read the files form HDFS and can not pass the binary data, you can do some encoding of it (base 64 for example, but you can think about sth more efficient since the range of characters accprable in the input string is wider than that used by BASE64). It should solve the problem until some king of binary input is supported ( is it going to happen? ). Piotr 2009/5/14 openresearch qiming...@openresearchinc.com All, I have read some recommendation regarding image (binary input) processing using Hadoop-streaming which only accept text out-of-box for now. http://hadoop.apache.org/core/docs/current/streaming.html https://issues.apache.org/jira/browse/HADOOP-1722 http://markmail.org/message/24woaqie2a6mrboc However, I have not got any straight answer. One recommendation is to put image data on HDFS, but we have to do hdf -get for each file/dir and process it locally which is every expensive. Another recommendation is to ...put them in a centralized place where all the hadoop nodes can access them (via .e.g, NFS mount)... Obviously, IO will becomes bottleneck and it defeat the purpose of distributed processing. I also notice some enhancement ticket is open for hadoop-core. Is it committed to any svn (0.21) branch? can somebody show me an example how to take *.jpg files (from HDFS), and process files in a distributed fashion using streaming? Many thanks -Qiming -- View this message in context: http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Is Mapper's map method thread safe?
Ultimately it depends on how you write the Mapper.map method. The framework supports a MultithreadedMapRunner which lets you set the number of threads running your map method simultaneously. Chapter 5 of my book covers this. On Wed, May 13, 2009 at 11:10 PM, Shengkai Zhu geniusj...@gmail.com wrote: Each mapper instance will be executed in separate JVM. On Thu, May 14, 2009 at 2:04 PM, imcaptor imcap...@gmail.com wrote: Dear all: Any one knows Is Mapper's map method thread safe? Thank you! imcaptor -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Selective output based on keys
The customary practice is to have your Reducer.reduce method handle the filtering if you are reducing your output. or the Mapper.map method if you are not. On Wed, May 13, 2009 at 1:57 PM, Asim linka...@gmail.com wrote: Hi, I wish to output only selective records to the output files based on keys. Is it possible to selectively write keys by setting setOutputKeyClass. Kindly let me know. Regards, Asim -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Setting up another machine as secondary node
any machine put in the conf/masters file becomes a secondary namenode. At some point there was confusion on the safety of more than one machine, which I believe was settled, as many are safe. The secondary namenode takes a snapshot at 5 minute (configurable) intervals, rebuilds the fsimage and sends that back to the namenode. There is some performance advantage of having it on the local machine, and some safety advantage of having it on an alternate machine. Could someone who remembers speak up on the single vrs multiple secondary namenodes? On Thu, May 14, 2009 at 6:07 AM, David Ritch david.ri...@gmail.com wrote: First of all, the secondary namenode is not a what you might think a secondary is - it's not failover device. It does make a copy of the filesystem metadata periodically, and it integrates the edits into the image. It does *not* provide failover. Second, you specify its IP address in hadoop-site.xml. This is where you can override the defaults set in hadoop-default.xml. dbr On Thu, May 14, 2009 at 9:03 AM, Rakhi Khatwani rakhi.khatw...@gmail.com wrote: Hi, I wanna set up a cluster of 5 nodes in such a way that node1 - master node2 - secondary namenode node3 - slave node4 - slave node5 - slave How do we go about that? there is no property in hadoop-env where i can set the ip-address for secondary name node. if i set node-1 and node-2 in masters, and when we start dfs, in both the m/cs, the namenode n secondary namenode processes r present. but i think only node1 is active. n my namenode fail over operation fails. ny suggesstions? Regards, Rakhi -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Map-side join: Sort order preserved?
Sort order is preserved if your Mapper doesn't change the key ordering in output. Partition name is not preserved. What I have done is to manually work out what the partition number of the output file should be for each map task, by calling the partitioner on an input key, and then renaming the output in the close method. Conceptually the place for this dance is in the OutputCommitter, but I haven't used them in production code, and my mapside join examples come from before they were available. the Hadoop join framework handles setting the split size to Long.MAX_VALUE for you. If you put up a discussion question on www.prohadoopbook.com, I will fill in the example on how to do this. On Thu, May 14, 2009 at 8:04 AM, Stuart White stuart.whi...@gmail.comwrote: I'm implementing a map-side join as described in chapter 8 of Pro Hadoop. I have two files that have been partitioned using the TotalOrderPartitioner on the same key into the same number of partitions. I've set mapred.min.split.size to Long.MAX_VALUE so that one Mapper will handle an entire partition. I want the output to be written in the same partitioned, total sort order. If possible, I want to accomplish this by setting my NumReducers to 0 and having the output of my Mappers written directly to HDFS, thereby skipping the partition/sort step. My question is this: Am I guaranteed that the Mapper that processes part-0 will have its output written to the output file named part-0, the Mapper that processes part-1 will have its output written to part-1, etc... ? If so, then I can preserve the partitioning/sort order of my input files without re-partitioning and re-sorting. Thanks. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: How to replace the storage on a datanode without formatting the namenode?
You can decommission the datanode, and then un-decommission it. On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu alexandra.al...@gmail.comwrote: Hi, I want to test how Hadoop and HBase are performing. I have a cluster with 1 namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2. I first ran a few tests when the 4 datanodes use local storage specified in dfs.data.dir. Now, I want to see what is the tradeoff if I switch from local storage to network mounted storage (I know it sounds like a crazy idea but unfortunately I have to explore this possibility). I would like to be able to change the dfs.data.dir and maybe in two steps be able to switch to the network mounted storage. What I had in mind was the following steps : 0. Assume initial status is a working cluster with local storage, e.g. dfs.data.dir set to local_storage_path. 1. Stop cluster: bin/stop-dfs 2. Change dfs.data.dir by adding the network_storage_path to the local storage_path. 3. Start cluster: bin/start-dfs (this will format the new network locations, which is nice) 4. Perform some sort of directed balancing of all the data towards the network storage location 5. Stop cluster: bin/stop-dfs 6. Change dfs.data.dir parameter to only contain local_storage_path 7. Start cluster and live happily ever after :-). The problem is , I don;t know if there is a command or an option to achieve step 4. Do you have any suggestions ? I found some info on how to add datanodes, but there is not much info on how to remove safely (without losing data etc) datanodes or storage locations on a particular node. Is this possible? Many thanks, Alexandra. -- View this message in context: http://www.nabble.com/How-to-replace-the-storage-on-a-datanode-without-formatting-the-namenode--tp23542127p23542127.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Map-side join: Sort order preserved?
In the mapside join, the input file name is not visible. as the input is actually a composite a large number of files. I have started answering in www.prohadoopbook.com On Thu, May 14, 2009 at 1:19 PM, Stuart White stuart.whi...@gmail.comwrote: On Thu, May 14, 2009 at 10:25 AM, jason hadoop jason.had...@gmail.com wrote: If you put up a discussion question on www.prohadoopbook.com, I will fill in the example on how to do this. Done. Thanks! http://www.prohadoopbook.com/forum/topics/preserving-partition-file -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: How to replace the storage on a datanode without formatting the namenode?
You can have separate configuration files for the different datanodes. If you are willing to deal with the complexity you can manually start them with altered properties from the command line. rsync or other means of sharing identical configs is simple and common. Raghu, your technique will only work well if you can complete steps 1-4 in less than the datanode timeout interval, which may be valid for Alexandria. I believe the timeout is 10 minutes. If you pass the timeout interval the namenode will start to rebalance the blocks, and when the datanode comes back it will delete all of the blocks it has rebalanced. On Thu, May 14, 2009 at 11:35 AM, Raghu Angadi rang...@yahoo-inc.comwrote: Along these lines, even simpler approach I would think is : 1) set data.dir to local and create the data. 2) stop the datanode 3) rsync local_dir network_dir 4) start datanode with data.dir with network_dir There is no need to format or rebalnace. This way you can switch between local and network multiple times (without needing to rsync data, if there are no changes made in the tests) Raghu. Alexandra Alecu wrote: Another possibility I am thinking about now, which is suitable for me as I do not actually have much data stored in the cluster when I want to perform this switch is to set the replication level really high and then simply remove the local storage locations and restart the cluster. With a bit of luck the high level of replication will allow a full recovery of the cluster on restart. Is this something that you would advice? Many thanks, Alexandra. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: hadoop streaming binary input / image processing
A downside of this approach is that you will not likely have data locality for the data on shared file systems, compared with data coming from an input split. That being said, from your script, *hadoop dfs -get FILE -* will write the file to standard out. On Thu, May 14, 2009 at 10:01 AM, Piotr Praczyk piotr.prac...@gmail.comwrote: just in addition to my previous post... You don't have to store the enceded files in a file system of course since you can write your own InoutFormat which wil do this on the fly... the overhead should not be that big. Piotr 2009/5/14 Piotr Praczyk piotr.prac...@gmail.com Hi If you want to read the files form HDFS and can not pass the binary data, you can do some encoding of it (base 64 for example, but you can think about sth more efficient since the range of characters accprable in the input string is wider than that used by BASE64). It should solve the problem until some king of binary input is supported ( is it going to happen? ). Piotr 2009/5/14 openresearch qiming...@openresearchinc.com All, I have read some recommendation regarding image (binary input) processing using Hadoop-streaming which only accept text out-of-box for now. http://hadoop.apache.org/core/docs/current/streaming.html https://issues.apache.org/jira/browse/HADOOP-1722 http://markmail.org/message/24woaqie2a6mrboc However, I have not got any straight answer. One recommendation is to put image data on HDFS, but we have to do hdf -get for each file/dir and process it locally which is every expensive. Another recommendation is to ...put them in a centralized place where all the hadoop nodes can access them (via .e.g, NFS mount)... Obviously, IO will becomes bottleneck and it defeat the purpose of distributed processing. I also notice some enhancement ticket is open for hadoop-core. Is it committed to any svn (0.21) branch? can somebody show me an example how to take *.jpg files (from HDFS), and process files in a distributed fashion using streaming? Many thanks -Qiming -- View this message in context: http://www.nabble.com/hadoop-streaming-binary-input---image-processing-tp23544344p23544344.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Datanodes fail to start
You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node passively waits for the datanodes to connect to it. On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote: Hello Everyone, Actually I had a cluster which was up. But i stopped the cluster as i wanted to format it.But cant start it back. 1)when i give start-dfs.sh I get following on screen starting namenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out slave1.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out slave3.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out slave4.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out slave2.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out slave5.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out slave6.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out slave9.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out slave8.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out slave7.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out slave10.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out hadoopmaster.local: starting secondarynamenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out 2) from log file named hadoop-hadoop-namenode-hadoopmaster.log I get following 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoopmaster/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009 / 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: hadoopmaster.local/192.168.0.1:9000 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of files = 1 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of files under construction = 0 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of size 80 loaded in 0 seconds. 2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file edits of size 4 edits # 0 loaded in 0 seconds. 2009-05-14 20:28:23,974 INFO org.apache.hadoop.fs.FSNamesystem: Finished loading FSImage in 155 msecs 2009-05-14 20:28:23,976 INFO org.apache.hadoop.fs.FSNamesystem: Total number of blocks = 0 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of invalid blocks = 0 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of under-replicated blocks = 0 2009-05-14 20:28:23,988 INFO org.apache.hadoop.fs.FSNamesystem: Number of over-replicated blocks = 0 2009-05-14 20:28:23,988 INFO org.apache.hadoop.dfs.StateChange: STATE* Leaving safe mode after 0 secs. *2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE* Network topology has 0 racks and 0 datanodes* 2009-05-14 20:28:23,989 INFO org.apache.hadoop.dfs.StateChange: STATE* UnderReplicatedBlocks has 0 blocks 2009-05-14 20:28:29,128 INFO org.mortbay.util.Credential: Checking Resource aliases 2009-05-14 20:28:29,243 INFO org.mortbay.http.HttpServer: Version
Re: Datanodes fail to start
The data node logs are on the datanode machines in the log directory. You may wish to buy my book and read chapter 4 on hdfs management. On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com wrote: Can u guide me where can I find datanode log files? As I cannot find it in $hadoop/logs and so. I can only find following files in logs folder :- hadoop-hadoop-namenode-hadoopmaster.log hadoop-hadoop-namenode-hadoopmaster.out hadoop-hadoop-namenode-hadoopmaster.out.1 hadoop-hadoop-secondarynamenode-hadoopmaster.log hadoop-hadoop-secondarynamenode-hadoopmaster.out hadoop-hadoop-secondarynamenode-hadoopmaster.out.1 history Thanks Pankil On Thu, May 14, 2009 at 11:27 PM, jason hadoop jason.had...@gmail.com wrote: You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node passively waits for the datanodes to connect to it. On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote: Hello Everyone, Actually I had a cluster which was up. But i stopped the cluster as i wanted to format it.But cant start it back. 1)when i give start-dfs.sh I get following on screen starting namenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out slave1.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave1.out slave3.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave3.out slave4.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave4.out slave2.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave2.out slave5.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave5.out slave6.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave6.out slave9.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave9.out slave8.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave8.out slave7.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave7.out slave10.local: starting datanode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-datanode-Slave10.out hadoopmaster.local: starting secondarynamenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-secondarynamenode-hadoopmaster.out 2) from log file named hadoop-hadoop-namenode-hadoopmaster.log I get following 2009-05-14 20:28:23,515 INFO org.apache.hadoop.dfs.NameNode: STARTUP_MSG: / STARTUP_MSG: Starting NameNode STARTUP_MSG: host = hadoopmaster/127.0.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.3 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 736250; compiled by 'ndaley' on Thu Jan 22 23:12:08 UTC 2009 / 2009-05-14 20:28:23,717 INFO org.apache.hadoop.ipc.metrics.RpcMetrics: Initializing RPC Metrics with hostName=NameNode, port=9000 2009-05-14 20:28:23,728 INFO org.apache.hadoop.dfs.NameNode: Namenode up at: hadoopmaster.local/192.168.0.1:9000 2009-05-14 20:28:23,733 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=NameNode, sessionId=null 2009-05-14 20:28:23,743 INFO org.apache.hadoop.dfs.NameNodeMetrics: Initializing NameNodeMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: fsOwner=hadoop,hadoop,adm,dialout,fax,cdrom,floppy,tape,audio,dip,video,plugdev,fuse,lpadmin,admin,sambashare 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: supergroup=supergroup 2009-05-14 20:28:23,856 INFO org.apache.hadoop.fs.FSNamesystem: isPermissionEnabled=true 2009-05-14 20:28:23,883 INFO org.apache.hadoop.dfs.FSNamesystemMetrics: Initializing FSNamesystemMeterics using context object:org.apache.hadoop.metrics.spi.NullContext 2009-05-14 20:28:23,885 INFO org.apache.hadoop.fs.FSNamesystem: Registered FSNamesystemStatusMBean 2009-05-14 20:28:23,964 INFO org.apache.hadoop.dfs.Storage: Number of files = 1 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Number of files under construction = 0 2009-05-14 20:28:23,971 INFO org.apache.hadoop.dfs.Storage: Image file of size 80 loaded in 0 seconds. 2009-05-14 20:28:23,972 INFO org.apache.hadoop.dfs.Storage: Edits file edits of size 4 edits # 0 loaded in 0 seconds. 2009-05-14 20:28:23,974
Re: hadoop streaming reducer values
You may wish to set the separator to the string comma space ', ' for your example. chapter 7 of my book goes into this in some detail, and I posted a graphic that visually depicts the process and the values about a month ago. The original post was titled 'Changing key/value separator in hadoop streaming' and I have attached the graphic. On Tue, May 12, 2009 at 7:55 PM, Alan Drew drewsk...@yahoo.com wrote: Hi, I have a question about the key, values that the reducer gets in Hadoop Streaming. I wrote a simple mapper.sh, reducer.sh script files: mapper.sh : #!/bin/bash while read data do #tokenize the data and output the values word, 1 echo $data | awk '{token=0; while(++token=NF) print $token\t1}' done reducer.sh : #!/bin/bash while read data do echo -e $data done The mapper tokenizes a line of input and outputs word, 1 pairs to standard output. The reducer just outputs what it gets from standard input. I have a simple input file: cat in the hat ate my mat the I was expecting the final output to be something like: the 1 1 1 cat 1 etc. but instead each word has its own line, which makes me think that key,value is being given to the reducer and not key, values which is default for normal Hadoop (in Java) right? the 1 the 1 the 1 cat 1 Is there any way to get key, values for the reducer and not a bunch of key, value pairs? I looked into the -reducer aggregate option, but there doesn't seem to be a way to customize what the reducer does with the key, values other than max,min functions. Thanks. -- View this message in context: http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: How can I get the actual time for one write operation in HDFS?
Close the file after you write one block, the close is synchronous. On Tue, May 12, 2009 at 11:50 PM, Xie, Tao xietao1...@gmail.com wrote: DFSOutputStream.writeChunk() enqueues packets into data queue and after that it returns. So write is asynchronous. I want to know the total actual time of HDFS executing the write operation (start from writeChunk() to the time that each replication is written on disk). How can get that time? Thanks. -- View this message in context: http://www.nabble.com/How-can-I-get-the-actual-time-for-one-write-operation-in-HDFS--tp23516363p23516363.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: hadoop streaming reducer values
Thanks chuck, I didn't read the post and focused on the commas On Wed, May 13, 2009 at 2:38 PM, Chuck Lam chuck@gmail.com wrote: The behavior you saw in Streaming (list of key,value instead of key, list of values) is indeed intentional, and it's part of the design differences between Streaming and Hadoop Java. That is, in Streaming your reducer is responsible for grouping values of the same key, whereas in Java the grouping is done for you. However, the input to your reducer is still sorted (and partitioned) on the key, so all key/value pairs of the same key will arrive at your reducer in one contiguous chunk. Your reducer can keep a last_key variable to track whether all records of the same key have been read in. In Python a reducer that sums up all values of a key is like this: #!/usr/bin/env python import sys (last_key, sum) = (None, 0.0) for line in sys.stdin: (key, val) = line.split(\t) if last_key and last_key != key: print last_key + \t + str(sum) sum = 0.0 last_key = key sum += float(val) print last_key + \t + str(sum) Streaming is covered in all 3 upcoming Hadoop books. The above is an example from mine ;) http://www.manning.com/lam/ . Tom White has the definite guide from O'Reilly - http://www.hadoopbook.com/ . Jason has http://www.apress.com/book/view/9781430219422 On Tue, May 12, 2009 at 7:55 PM, Alan Drew drewsk...@yahoo.com wrote: Hi, I have a question about the key, values that the reducer gets in Hadoop Streaming. I wrote a simple mapper.sh, reducer.sh script files: mapper.sh : #!/bin/bash while read data do #tokenize the data and output the values word, 1 echo $data | awk '{token=0; while(++token=NF) print $token\t1}' done reducer.sh : #!/bin/bash while read data do echo -e $data done The mapper tokenizes a line of input and outputs word, 1 pairs to standard output. The reducer just outputs what it gets from standard input. I have a simple input file: cat in the hat ate my mat the I was expecting the final output to be something like: the 1 1 1 cat 1 etc. but instead each word has its own line, which makes me think that key,value is being given to the reducer and not key, values which is default for normal Hadoop (in Java) right? the 1 the 1 the 1 cat 1 Is there any way to get key, values for the reducer and not a bunch of key, value pairs? I looked into the -reducer aggregate option, but there doesn't seem to be a way to customize what the reducer does with the key, values other than max,min functions. Thanks. -- View this message in context: http://www.nabble.com/hadoop-streaming-reducer-values-tp23514523p23514523.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: sub 60 second performance
You would need to read the data, and store it in an internal data structure, or copy the file to a local file system file and mmap it if you didn't want to store it in java heap space. Your map then has to deal with the fact that the data isn't being passed in directly. This is not straight forward to do. On Sun, May 10, 2009 at 4:53 PM, Matt Bowyer mattbowy...@googlemail.comwrote: Thanks Jason, how can I get access to the particular block? do you mean create a static map inside the task (add the values).. and check if populated on the next run? or is there a more elegant/triedtested solution? thanks again On Mon, May 11, 2009 at 12:41 AM, jason hadoop jason.had...@gmail.com wrote: You can cache the block in your task, in a pinned static variable, when you are reusing the jvms. On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.com wrote: Hi, I am trying to do 'on demand map reduce' - something which will return in reasonable time (a few seconds). My dataset is relatively small and can fit into my datanode's memory. Is it possible to keep a block in the datanode's memory so on the next job the response will be much quicker? The majority of the time spent during the job run appears to be during the 'HDFS_BYTES_READ' part of the job. I have tried using the setNumTasksToExecutePerJvm but the block still seems to be cleared from memory after the job. thanks! -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Re-Addressing a cluster
Now that I think about it, the reverse lookups in my clusters work. On Mon, May 11, 2009 at 3:07 AM, Steve Loughran ste...@apache.org wrote: jason hadoop wrote: You should be able to relocate the cluster's IP space by stopping the cluster, modifying the configuration files, resetting the dns and starting the cluster. Be best to verify connectivity with the new IP addresses before starting the cluster. to the best of my knowledge the namenode doesn't care about the ip addresses of the datanodes, only what blocks they report as having. The namenode does care about loosing contact with a connected datanode, replicating the blocks that are now under replicated. I prefer IP addresses in my configuration files but that is a personal preference not a requirement. I do deployments on to Virtual clusters without fully functional reverse DNS, things do work badly in that situation. Hadoop assumes that if a machine looks up its hostname, it can pass that to peers and they can resolve it, the well managed network infrastructure assumption. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: sub 60 second performance
You can cache the block in your task, in a pinned static variable, when you are reusing the jvms. On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.comwrote: Hi, I am trying to do 'on demand map reduce' - something which will return in reasonable time (a few seconds). My dataset is relatively small and can fit into my datanode's memory. Is it possible to keep a block in the datanode's memory so on the next job the response will be much quicker? The majority of the time spent during the job run appears to be during the 'HDFS_BYTES_READ' part of the job. I have tried using the setNumTasksToExecutePerJvm but the block still seems to be cleared from memory after the job. thanks! -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: large files vs many files
You must create unique file names, I don't believe (but I do not know) that the append could will allow multiple writers. Are you writing from within a task, or as an external application writing into hadoop. You may try using UUID, http://java.sun.com/j2se/1.5.0/docs/api/java/util/UUID.html, as part of your filename. Without knowing more about your goals, environment and constraints it is hard to offer any more detailed suggestions. You could also have an application aggregate the streams and write out chunks, with one or more writers, one per output file. On Sat, May 9, 2009 at 12:15 AM, Sasha Dolgy sdo...@gmail.com wrote: yes, that is the problem. two or hundreds...data streams in very quickly. On Fri, May 8, 2009 at 8:42 PM, jason hadoop jason.had...@gmail.com wrote: Is it possible that two tasks are trying to write to the same file path? On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi Tom (or anyone else), Will SequenceFile allow me to avoid problems with concurrent writes to the file? I stll continue to get the following exceptions/errors in hdfs: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client 127.0.0.1 because current leaseholder is trying to recreate file. Only happens when two processes are trying to write at the same time. Now ideally I don't want to buffer the data that's coming in and i want to get it out and into the file asap to avoid any data loss...am i missing something here? is there some sort of factory i can implement to help in writing a lot of simultaneous data streams? thanks in advance for any suggestions -sasha On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote: Hi Sasha, As you say, HDFS appends are not yet working reliably enough to be suitable for production use. On the other hand, having lots of little files is bad for the namenode, and inefficient for MapReduce (see http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so it's best to avoid this too. I would recommend using SequenceFile as a storage container for lots of small pieces of data. Each key-value pair would represent one of your little files (you can have a null key, if you only need to store the contents of the file). You can also enable compression (use block compression), and SequenceFiles are designed to work well with MapReduce. Cheers, Tom On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com wrote: hi there, working through a concept at the moment and was attempting to write lots of data to few files as opposed to writing lots of data to lots of little files. what are the thoughts on this? When I try and implement outputStream = hdfs.append(path); there doesn't seem to be any locking mechanism in place ... or there is and it doesn't work well enough for many writes per second? i have read and seen that the property dfs.support.append is not meant for production use. still, if millions of little files are as good or better --- or no difference -- to a few massive files then i suppose append isn't something i really need. I do see a lot of stack traces with messages like: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client 127.0.0.1 because current leaseholder is trying to recreate file. i hope this make sense. still a little bit confused. thanks in advance -sd -- Sasha Dolgy sasha.do...@gmail.com -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Sasha Dolgy sasha.do...@gmail.com -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: ClassNotFoundException
rel is short for the hadoop version you are using 0.18.x, 0.19.x or 0.20.x etc You must make all of the required jars available to all of your tasks. You can either install them all the tasktracker machines and setup the tasktracker classpath to include them, or distributed them via the distributed cache. chapter 5 of my book goes into this in some detail, and is available now as a download. http://www.apress.com/book/view/9781430219422 On Fri, May 8, 2009 at 4:22 PM, georgep p09...@gmail.com wrote: Sorry, I misspell you name, Jason George georgep wrote: Hi Joe, Thank you for the reply, but do I need to include every supporting jar file to the application path? What is the -rel-? George jason hadoop wrote: 1) when running under windows, include the cygwin bin directory in your windows path environment variable 2) eclipse is not so good at submitting supporting jar files, in your application lauch path add a -libjars path/hadoop-rel-examples.jar. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- View this message in context: http://www.nabble.com/ClassNotFoundException-tp23441528p23455206.html Sent from the Hadoop core-user mailing list archive at Nabble.com. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Error when start hadoop cluster.
looks like you have different versions of the jars, or perhaps a someone has run ant in one of your installation directories. On Fri, May 8, 2009 at 7:54 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote: Hi all! I cannot start hdfs successful. I checked log file and found following message: 2009-05-09 08:17:55,026 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = haris1.asnet.local/192.168.1.180 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.18.2 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18 -r 709042; compiled by 'ndaley' on Thu Oct 30 01:07:18 UTC 2008 / 2009-05-09 08:17:55,302 ERROR org.apache.hadoop.dfs.DataNode: java.io.IOException: Incompatible namespaceIDs in /tmp/hadoop-root/dfs/data: namenode namespaceID = 880518114; datanode namespaceID = 461026751 at org.apache.hadoop.dfs.DataStorage.doTransition(DataStorage.java:226) at org.apache.hadoop.dfs.DataStorage.recoverTransitionRead(DataStorage.java:141) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:306) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:223) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:3031) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2986) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2994) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:3116) 2009-05-09 08:17:55,303 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at haris1.asnet.local/192.168.1.180 / Please help me! P/s: I's using Hadoop-0.18.2 and Hbase 0.18.1 Thanks, Best. Nguyen -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Most efficient way to support shared content among all mappers
Thanks Jeff! On Sat, May 9, 2009 at 1:31 PM, Jeff Hammerbacher ham...@cloudera.comwrote: Hey, For a more detailed discussion of how to use memcached for this purpose, see the paper Low-Latency, High-Throughput Access to Static Global Resources within the Hadoop Framework: http://www.umiacs.umd.edu/~jimmylin/publications/Lin_etal_TR2009.pdfhttp://www.umiacs.umd.edu/%7Ejimmylin/publications/Lin_etal_TR2009.pdf . Regards, Jeff On Fri, May 8, 2009 at 2:49 PM, jason hadoop jason.had...@gmail.com wrote: Most of the people with this need are using some variant of memcached, or other distributed hash table. On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote: Hi, As a newcomer to Hadoop, I wonder any efficient way to support shared content among all mappers. For example, to implement an neural network algorithm, I want the NN data structure accessible by all mappers. Thanks for your comments! - Joe -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Re-Addressing a cluster
You should be able to relocate the cluster's IP space by stopping the cluster, modifying the configuration files, resetting the dns and starting the cluster. Be best to verify connectivity with the new IP addresses before starting the cluster. to the best of my knowledge the namenode doesn't care about the ip addresses of the datanodes, only what blocks they report as having. The namenode does care about loosing contact with a connected datanode, replicating the blocks that are now under replicated. I prefer IP addresses in my configuration files but that is a personal preference not a requirement. On Sat, May 9, 2009 at 11:51 AM, John Kane john.k...@kane.net wrote: I have a situation that I have not been able to find in the mail archives. I have an active cluster that was built on a private switch with private IP address space (192.168.*.*) I need to relocate the cluster into real address space. Assuming that I have working DNS, is there an issue? Do I just need to be sure that I utilize hostnames for everything and then be fat, dumb and happy? Or are IP Addresses tracked by the namenode, etc? Thanks -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: large files vs many files
Is it possible that two tasks are trying to write to the same file path? On Fri, May 8, 2009 at 11:46 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi Tom (or anyone else), Will SequenceFile allow me to avoid problems with concurrent writes to the file? I stll continue to get the following exceptions/errors in hdfs: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client 127.0.0.1 because current leaseholder is trying to recreate file. Only happens when two processes are trying to write at the same time. Now ideally I don't want to buffer the data that's coming in and i want to get it out and into the file asap to avoid any data loss...am i missing something here? is there some sort of factory i can implement to help in writing a lot of simultaneous data streams? thanks in advance for any suggestions -sasha On Wed, May 6, 2009 at 9:40 AM, Tom White t...@cloudera.com wrote: Hi Sasha, As you say, HDFS appends are not yet working reliably enough to be suitable for production use. On the other hand, having lots of little files is bad for the namenode, and inefficient for MapReduce (see http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/), so it's best to avoid this too. I would recommend using SequenceFile as a storage container for lots of small pieces of data. Each key-value pair would represent one of your little files (you can have a null key, if you only need to store the contents of the file). You can also enable compression (use block compression), and SequenceFiles are designed to work well with MapReduce. Cheers, Tom On Wed, May 6, 2009 at 12:34 AM, Sasha Dolgy sasha.do...@gmail.com wrote: hi there, working through a concept at the moment and was attempting to write lots of data to few files as opposed to writing lots of data to lots of little files. what are the thoughts on this? When I try and implement outputStream = hdfs.append(path); there doesn't seem to be any locking mechanism in place ... or there is and it doesn't work well enough for many writes per second? i have read and seen that the property dfs.support.append is not meant for production use. still, if millions of little files are as good or better --- or no difference -- to a few massive files then i suppose append isn't something i really need. I do see a lot of stack traces with messages like: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException: failed to create file /foo/bar/aaa.bbb.ccc.ddd.xxx for DFSClient_-1821265528 on client 127.0.0.1 because current leaseholder is trying to recreate file. i hope this make sense. still a little bit confused. thanks in advance -sd -- Sasha Dolgy sasha.do...@gmail.com -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Setting thread stack size for child JVM
You an set the mapred.child.java.opts on a per job basis either via -D mapred.child.java.ops=java options or via conf.set(mapred.child.java.opts, java options). Note: the conf.set must be done before the job is submitted. On Fri, May 8, 2009 at 11:57 AM, Philip Zeyliger phi...@cloudera.comwrote: You could add -Xssn to the mapred.child.java.opts configuration setting. That's controlling the Java stack size, which I think is the relevant bit for you. Cheers, -- Philip property namemapred.child.java.opts/name value-Xmx200m/value descriptionJava opts for the task tracker child processes. The following symbol, if present, will be interpolated: @taskid@ is replaced by current TaskID. Any other occurrences of '@' will go unchanged. For example, to enable verbose gc logging to a file named for the taskid in /tmp and to set the heap maximum to be a gigabyte, pass a 'value' of: -Xmx1024m -verbose:gc -Xloggc:/tmp/@tas...@.gc The configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child processes. /description /property On Fri, May 8, 2009 at 11:16 AM, Ken Krugler kkrugler_li...@transpac.com wrote: Hi there, For a very specific type of reduce task, we currently need to use a large number of threads. To avoid running out of memory, I'd like to constrain the Linux stack size via a ulimit -s xxx shell script command before starting up the JVM. I could do this for the entire system at boot time, but it would be better to have it for just the Hadoop JVM(s). Any suggestions for how best to handle this? Thanks, -- Ken -- Ken Krugler +1 530-210-6378 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Most efficient way to support shared content among all mappers
Most of the people with this need are using some variant of memcached, or other distributed hash table. On Fri, May 8, 2009 at 10:07 AM, Joe joe_...@yahoo.com wrote: Hi, As a newcomer to Hadoop, I wonder any efficient way to support shared content among all mappers. For example, to implement an neural network algorithm, I want the NN data structure accessible by all mappers. Thanks for your comments! - Joe -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: All keys went to single reducer in WordCount program
Most likely the 3rd mapper ran as a speculative execution, and it is possible that all of your keys hashed to a single partition. Also, if you don't specify the default is to run a single reduce task. From JobConf, /** * Get configured the number of reduce tasks for this job. Defaults to * code1/code. * * @return the number of reduce tasks for this job. */ public int getNumReduceTasks() { return getInt(mapred.reduce.tasks, *1*); } On Thu, May 7, 2009 at 3:54 AM, Miles Osborne mi...@inf.ed.ac.uk wrote: with such a small data set who knows what will happen: you are probably hitting minimal limits of some kind repeat this with more data Miles 2009/5/7 Foss User foss...@gmail.com: I have two reducers running on two different machines. I ran the example word count program with some of my own System.out.println() statements to see what is going on. There were 2 slaves each running datanode as well as tasktracker. There was one namenode and one jobtracker. I know there is a very elaborate setup for such a small cluster but I did it only to learn. I gave two input files, a.txt and b.txt with a few lines of english text. Now, here are my questions. (1) I found that three mapper tasks ran, all in the first slave. The first task processed the first file. The second task processed the second file. The third task didn't process anything. Why is it that the third task did not process anything? Why was this task created in the first place? (2) I found only one reducer task, on the second slave. It processed all the values for keys. keys were words in this case of Text type. I tried printing out the key.hashCode() for each key and some of them were even and some of them were odd. I was expecting the keys with even hashcodes to go to one slave and the others to go to another slave. Why didn't this happen? -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: how to improve the Hadoop's capability of dealing with small files
The way I typically address that is to write a zip file using the zip utilities. Commonly for output. HDFS is not optimized for low latency, but for high through put for bulk operations. 2009/5/7 Edward Capriolo edlinuxg...@gmail.com 2009/5/7 Jeff Hammerbacher ham...@cloudera.com: Hey, You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com If You want to use many small files, they are probably having the same purpose and struc? Why not use HBase instead of a raw HDFS ? Many small files would be packed together and the problem would disappear. cheers Piotr 2009/5/7 Jonathan Cao jonath...@rockyou.com There are at least two design choices in Hadoop that have implications for your scenario. 1. All the HDFS meta data is stored in name node memory -- the memory size is one limitation on how many small files you can have 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer job has enough work to offset the overhead of spawning the job. It relies on each task reading contiguous chuck of data (typically 64MB), your small file situation will change those efficient sequential reads to larger number of inefficient random reads. Of course, small is a relative term? Jonathan 2009/5/6 陈桂芬 chenguifen...@163.com Hi: In my application, there are many small files. But the hadoop is designed to deal with many large files. I want to know why hadoop doesn't support small files very well and where is the bottleneck. And what can I do to improve the Hadoop's capability of dealing with small files. Thanks. When the small file problem comes up most of the talk centers around the inode table being in memory. The cloudera blog points out something: Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern. My application attempted to load 9000 6Kb files using a single threaded application and the FSOutpustStream objects to write directly to hadoop files. My plan was to have hadoop merge these files in the next step. I had to abandon this plan because this process was taking hours. I knew HDFS had a small file problem but I never realized that I could not do this problem the 'old fashioned way'. I merged the files locally and uploading a few small files gave great throughput. Small files is not just a permanent storage issue it is a serious optimization. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Large number of map output keys and performance issues.
It may simply be that your JVM's are spending their time doing garbage collection instead of running your tasks. My book, in chapterr 6 has a section on how to tune your jobs, and how to determine what to tune. That chapter is available now as an alpha. On Wed, May 6, 2009 at 1:29 PM, Todd Lipcon t...@cloudera.com wrote: Hi Tiago, Here are a couple thoughts: 1) How much data are you outputting? Obviously there is a certain amount of IO involved in actually outputting data versus not ;-) 2) Are you using a reduce phase in this job? If so, since you're cutting off the data at map output time, you're also avoiding a whole sort computation which involves significant network IO, etc. 3) What version of Hadoop are you running? Thanks -Todd On Wed, May 6, 2009 at 12:23 PM, Tiago Macambira macamb...@gmail.com wrote: I am developing a MR application w/ hadoop that is generating during it's map phase a really large number of output keys and it is having an abysmal performance. While just reading the said data takes 20 minutes and processing it but not outputting anything from the map takes around 30 min, running the full application takes around 4 hours. Is this a known or expected issue? Cheers. Tiago Alves Macambira -- I may be drunk, but in the morning I will be sober, while you will still be stupid and ugly. -Winston Churchill -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Using multiple FileSystems in hadoop input
I have used multiple file systems in jobs, but not used Har as one of them. Worked for me in 18 On Wed, May 6, 2009 at 4:07 AM, Tom White t...@cloudera.com wrote: Hi Ivan, I haven't tried this combination, but I think it should work. If it doesn't it should be treated as a bug. Tom On Wed, May 6, 2009 at 11:46 AM, Ivan Balashov ibalas...@iponweb.net wrote: Greetings to all, Could anyone suggest if Paths from different FileSystems can be used as input of Hadoop job? Particularly I'd like to find out whether Paths from HarFileSystem can be mixed with ones from DistributedFileSystem. Thanks, -- Kind regards, Ivan -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: multi-line records and file splits
Hey Tom, I had no luck using the StreamingXmlRecordReader for non XML files are there any parameters that you need to add in? I was testing with 0.19.0 On Wed, May 6, 2009 at 5:25 AM, Sharad Agarwal shara...@yahoo-inc.comwrote: The split doesn't need to be at the record boundary. If a mapper gets a partial record, it will seek to another split to get the full record. - Sharad -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: specific block size for a file
Please try -D dfs.block.size=4096000 The specification must be in bytes. On Tue, May 5, 2009 at 4:47 AM, Christian Ulrik Søttrup soett...@nbi.dkwrote: Hi all, I have a job that creates very big local files so i need to split it to as many mappers as possible. Now the DFS block size I'm using means that this job is only split to 3 mappers. I don't want to change the hdfs wide block size because it works for my other jobs. Is there a way to give a specific file a different block size. The documentation says it is, but does not explain how. I've tried: hadoop dfs -D dfs.block.size=4M -put file /dest/ But that does not work. any help would be apreciated. Cheers, Chrulle -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Sorting on several columns using KeyFieldSeparator and Paritioner
You must only have 3 fields in your keys. Try this - it is my best guess based on your code. Appendix A of my book has a detailed discussion of these fields and the gotchas, and the example code has test classes that allow you to try different keys with different input to see how the parts are actually broken out of the keys. jobConf.set(mapred.text.key.partitioner.options,-k2.2 -k1.1 -k3.3); jobConf.set(mapred.text.key.comparator.options,-k2.2r -k1.1rn -k3.3n); On Tue, May 5, 2009 at 2:05 AM, Min Zhou coderp...@gmail.com wrote: I came across the same failure. Anyone can solve this problem? On Sun, Jan 18, 2009 at 9:06 AM, Saptarshi Guha saptarshi.g...@gmail.com wrote: Hello, I have a file with n columns, some which are text and some numeric. Given a sequence of indices, i would like to sort on those indices i.e first on Index1, then within Index2 and so on. In the example code below, i have 3 columns, numeric, text, numeric, space separated. Sort on 2(reverse), then 1(reverse,numeric) and lastly 3 Though my code runs (and gives wrong results,col 2 is sorted in reverse, and within that col3 which is treated as tex and then col1 ) on the local, when distributed I get a merge error - my guess is fixing the latter fixes the former. This is the error: java.io.IOException: Final merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2093) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.access$400(ReduceTask.java:457) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:380) at org.apache.hadoop.mapred.Child.main(Child.java:155) Caused by: java.lang.ArrayIndexOutOfBoundsException: 562 at org.apache.hadoop.io.WritableComparator.compareBytes(WritableComparator.java:128) at org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compareByteSequence(KeyFieldBasedComparator.java:109) at org.apache.hadoop.mapred.lib.KeyFieldBasedComparator.compare(KeyFieldBasedComparator.java:85) at org.apache.hadoop.mapred.Merger$MergeQueue.lessThan(Merger.java:308) at org.apache.hadoop.util.PriorityQueue.downHeap(PriorityQueue.java:144) at org.apache.hadoop.util.PriorityQueue.adjustTop(PriorityQueue.java:103) at org.apache.hadoop.mapred.Merger$MergeQueue.adjustPriorityQueue(Merger.java:270) at org.apache.hadoop.mapred.Merger$MergeQueue.next(Merger.java:285) at org.apache.hadoop.mapred.Merger.writeFile(Merger.java:108) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier.createKVIterator(ReduceTask.java:2087) ... 3 more Thanks for your time And the code (not too big) is ==CODE== public class RMRSort extends Configured implements Tool { static class RMRSortMap extends MapReduceBase implements MapperLongWritable, Text, Text, Text { public void map(LongWritable key, Text value,OutputCollectorText, Text output, Reporter reporter) throws IOException { output.collect(value,value); } } static class RMRSortReduce extends MapReduceBase implements ReducerText, Text, NullWritable, Text { public void reduce(Text key, IteratorText values,OutputCollectorNullWritable, Text output, Reporter reporter) throws IOException { NullWritable n = NullWritable.get(); while(values.hasNext()) output.collect(n,values.next() ); } } static JobConf createConf(String rserveport,String uid,String infolder, String outfolder) Configuration defaults = new Configuration(); JobConf jobConf = new JobConf(defaults, RMRSort.class); jobConf.setJobName(Sorter: +uid); jobConf.addResource(new Path(System.getenv(HADOOP_CONF_DIR)+/hadoop-site.xml)); // jobConf.set(mapred.job.tracker, local); jobConf.setMapperClass(RMRSortMap.class); jobConf.setReducerClass(RMRSortReduce.class); jobConf.set(map.output.key.field.separator,fsep); jobConf.setPartitionerClass(KeyFieldBasedPartitioner.class); jobConf.set(mapred.text.key.partitioner.options,-k2,2 -k1,1 -k3,3); jobConf.setOutputKeyComparatorClass(KeyFieldBasedComparator.class); jobConf.set(mapred.text.key.comparator.options,-k2r,2r -k1rn,1rn -k3n,3n); //infolder, outfolder information removed jobConf.setMapOutputKeyClass(Text.class); jobConf.setMapOutputValueClass(Text.class); jobConf.setOutputKeyClass(NullWritable.class); return(jobConf); } public int run(String[] args) throws Exception { return(1); } } -- Saptarshi Guha - saptarshi.g...@gmail.com -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. My profile: http://www.linkedin.com/in/coderplay My blog: http://coderplay.javaeye.com --
Re: specific block size for a file
Trade off between hdfs efficiency and data locality. On Tue, May 5, 2009 at 9:37 AM, Arun C Murthy a...@yahoo-inc.com wrote: On May 5, 2009, at 4:47 AM, Christian Ulrik Søttrup wrote: Hi all, I have a job that creates very big local files so i need to split it to as many mappers as possible. Now the DFS block size I'm using means that this job is only split to 3 mappers. I don't want to change the hdfs wide block size because it works for my other jobs. I would rather keep the big files on HDFS and use -Dmapred.min.split.size to get more maps to process your data http://hadoop.apache.org/core/docs/r0.20.0/mapred_tutorial.html#Job+Input Arun -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 www.prohadoopbook.com a community for Hadoop Professionals
Re: Global object for a map task
If it is relatively small you can pass it via the JobConf object, storing a serialized version of your dataset. If it is larger you can pass a serialized version via the distributed cache. Your map task will need to deserialize the object in the configure method. None of the above methods give you an object that is write shared between map tasks. Please remember that the map tasks execute in separate JVM's on distinct machines in the normal MapReduce environment. On Sat, May 2, 2009 at 10:59 PM, Amandeep Khurana ama...@gmail.com wrote: How can I create a global variable for each node running my map task. For example, a common ArrayList that my map function can access for every k,v pair it works on. It doesnt really need to create the ArrayList everytime. If I create it in the main function of the job, the map function gets a null pointer exception. Where else can this be created? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Global object for a map task
define some key of yours, say my.app.array, serialize the object to a string, say myObjectString Then conf.set(my.app.array, myObjectString) Then in your Mapper.configure() method String myObjectString = conf.get(my.app.array) deserialize your object. A google search on java object serialization to string, will provide you with many examples, such as http://www.velocityreviews.com/forums/showpost.php?p=3185744postcount=10. On Sat, May 2, 2009 at 11:56 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Jason. My object is relatively small. But how do I pass it via the JobConf object? Can you elaborate a bit... Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Sat, May 2, 2009 at 11:53 PM, jason hadoop jason.had...@gmail.com wrote: If it is relatively small you can pass it via the JobConf object, storing a serialized version of your dataset. If it is larger you can pass a serialized version via the distributed cache. Your map task will need to deserialize the object in the configure method. None of the above methods give you an object that is write shared between map tasks. Please remember that the map tasks execute in separate JVM's on distinct machines in the normal MapReduce environment. On Sat, May 2, 2009 at 10:59 PM, Amandeep Khurana ama...@gmail.com wrote: How can I create a global variable for each node running my map task. For example, a common ArrayList that my map function can access for every k,v pair it works on. It doesnt really need to create the ArrayList everytime. If I create it in the main function of the job, the map function gets a null pointer exception. Where else can this be created? Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Sequence of Streaming Jobs
if you are using the sh or bash, the variable $? holds the exit status of the last command to execute. hadoop jar streaming.jar ... if [ $? -ne 0 ]; then echo My job failed 21 exit 1 fi Caution $? is the very last command to execute's exit status. It is easy to run another command before testing and then test the wrong command's exit status On Sat, May 2, 2009 at 11:56 AM, Mayuran Yogarajah mayuran.yogara...@casalemedia.com wrote: Billy Pearson wrote: I done this with and array of commands for the jobs in a php script checking the return of the job to tell if it failed or not. Billy I have this same issue.. How do you check if a job failed or not? You mentioned checking the return code? How are you doing that ? thanks Dan Milstein dmilst...@hubteam.com wrote in message news:58d66a11-b59c-49f8-b72f-7507482c3...@hubteam.com... If I've got a sequence of streaming jobs, each of which depends on the output of the previous one, is there a good way to launch that sequence? Meaning, I want step B to only start once step A has finished. From within Java JobClient code, I can do submitJob/runJob, but is there any sort of clean way to do this for a sequence of streaming jobs? Thanks, -Dan Milstein -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: cannot open an hdfs file in O_RDWR mode
In hadoop 0.19.1, (and 19.0) libhdfs (which is used by the fuse package for hdfs access) explicitly denies open requests that pass O_RDWR If you have binary applications that pass the flag, but would work correctly given the limitations of HDFS, you may alter the code in src/c++/libhdfs/hdfs.c to allow it, or build a shared library that you preload that changes the flags passed to the real open. Hacking hdfs.c is much simpler. Line 407 of hdfs.c jobject jFS = (jobject)fs; if (flags O_RDWR) { fprintf(stderr, ERROR: cannot open an hdfs file in O_RDWR mode\n); errno = ENOTSUP; return NULL; } On Fri, May 1, 2009 at 6:34 PM, Philip Zeyliger phi...@cloudera.com wrote: HDFS does not allow you to overwrite bytes of a file that have already been written. The only operations it supports are read (an existing file), write (a new file), and (in newer versions, not always enabled) append (to an existing file). -- Philip On Fri, May 1, 2009 at 5:56 PM, Robert Engel enge...@ligo.caltech.edu wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hello, I am using Hadoop on a small storage cluster (x86_64, CentOS 5.3, Hadoop-0.19.1). The hdfs is mounted using fuse and everything seemed to work just fine so far. However, I noticed that I cannot: 1) use svn to check out files on the mounted hdfs partition 2) request that stdout and stderr of Globus jobs is written to the hdfs partition In both cases I see following error message in /var/log/messages: fuse_dfs: ERROR: could not connect open file fuse_dfs.c:1364 When I run fuse_dfs in debugging mode I get: ERROR: cannot open an hdfs file in O_RDWR mode unique: 169, error: -5 (Input/output error), outsize: 16 My question is if this is a general limitation of Hadoop or if this operation is just not currently supported? I searched Google and JIRA but could not find an answer. Thanks, Robert -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iEYEARECAAYFAkn7mksACgkQrxCAtr5BXdMx5wCeICTHQbOwjZoGpVTO6ayd7l7t LXoAn0WBwfo6ZYdJX1sh2eO2owAR0HLm =PUCc -END PGP SIGNATURE- -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: unable to see anything in stdout
Less work by skipping setting up the input splits, distributing the job jar files, scheduling the map tasks on the task trackers, collecting the task status results, then starting all the reduce tasks, collecting all the results, sorting them, feeding them to the reduce tasks, then writing them to hdfs. No jvm forking etc. The overhead of co-ordination is not small, but pays off when the jobs are large and the number of machines are large. local works great for jobs that really only need the resources of a single machine and only require a single reduce task. On Fri, May 1, 2009 at 6:09 PM, Asim linka...@gmail.com wrote: Thanks Aaron. That worked! However, when i run everything as local, I see everything executing much faster on local as compared to a single node. Is there any reason for the same? -Asim On Thu, Apr 30, 2009 at 9:23 AM, Aaron Kimball aa...@cloudera.com wrote: First thing I would do is to run the job in the local jobrunner (as a single process on your local machine without involving the cluster): JobConf conf = . // set other params, mapper, etc. here conf.set(mapred.job.tracker, local); // use localjobrunner conf.set(fs.default.name, file:///); // read from local hard disk instead of hdfs JobClient.runJob(conf); This will actually print stdout, stderr, etc. to your local terminal. Try this on a single input file. This will let you confirm that it does, in fact, write to stdout. - Aaron On Thu, Apr 30, 2009 at 9:00 AM, Asim linka...@gmail.com wrote: Hi, I am not able to see any job output in userlogs/task_id/stdout. It remains empty even though I have many println statements. Are there any steps to debug this problem? Regards, Asim -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: What's the local heap size of Hadoop? How to increase it?
you can also specify it on the command line of your hadoop job hadoop jar jarfile [main class] -D mapred.child.java.opts=-Xmx800M other arguments Note: there is a space between the -D and mapred, and the -D has to come after the main class specification. This parameter may also be specified via conf.set(mapred.child.java.opts,-Xmx800m); before submitting the job. On Wed, Apr 29, 2009 at 2:59 PM, Bhupesh Bansal bban...@linkedin.comwrote: Hey , Try adding property namemapred.child.java.opts/name value-Xmx800M -server/value /property With the right JVM size in your hadoop-site.xml , you will have to copy this to all mapred nodes and restart the cluster. Best Bhupesh On 4/29/09 2:03 PM, Jasmine (Xuanjing) Huang xjhu...@cs.umass.edu wrote: Hi, there, What's the local heap size of Hadoop? I have tried to load a local cache file which is composed of 500,000 short phrase, but the task failed. The output of Hadoop looks like(com.aliasi.dict.ExactDictionaryChunker is a third-party jar package, and the records is organized as a trie struction): java.lang.OutOfMemoryError: Java heap space at java.util.HashMap.addEntry(HashMap.java:753) at java.util.HashMap.put(HashMap.java:385) at com.aliasi.dict.ExactDictionaryChunker$TrieNode.getOrCreateDaughter(Ex actDictionaryChunker.java:476) at com.aliasi.dict.ExactDictionaryChunker$TrieNode.add(ExactDictionaryChu nker.java:484) When I reduce the total record number to 30,000. My mapreduce job became succeed. So, I have a question, What's the local heap size of Hadoop's Java Virtual Machine? How to increase it? Best, Jasmine -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Load .so library error when Hadoop calls JNI interfaces
You need to make sure that the shared library is available on the tasktracker nodes, either by installing it, or by pushing it around via the distributed cache On Wed, Apr 29, 2009 at 8:19 PM, Ian jonhson jonhson@gmail.com wrote: Dear all, I wrote a plugin codes for Hadoop, which calls the interfaces in Cpp-built .so library. The plugin codes are written in java, so I prepared a JNI class to encapsulate the C interfaces. The java codes can be executed successfully when I compiled it and run it standalone. However, it does not work when I embedded in Hadoop. The exception shown out is (found in Hadoop logs): screen dump - # grep myClass logs/* -r logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:Exception in thread JVM Runner jvm_200904261632_0001_m_-1217897050 spawned. java.lang.UnsatisfiedLinkError: org.apache.hadoop.mapred.myClass.myClassfsMount(Ljava/lang/String;)I logs/hadoop-hadoop-tasktracker-testbed0.container.org.out: at org.apache.hadoop.mapred.myClass.myClassfsMount(Native Method) logs/hadoop-hadoop-tasktracker-testbed0.container.org.out:Exception in thread JVM Runner jvm_200904261632_0001_m_-1887898624 spawned. java.lang.UnsatisfiedLinkError: org.apache.hadoop.mapred.myClass.myClassfsMount(Ljava/lang/String;)I logs/hadoop-hadoop-tasktracker-testbed0.container.org.out: at org.apache.hadoop.mapred.myClass.myClassfsMount(Native Method) ... It seems the library can not be loaded in Hadoop. My codes (myClass.java) is like: --- myClass.java -- public class myClass { public static final Log LOG = LogFactory.getLog(org.apache.hadoop.mapred.myClass); public myClass() { try { //System.setProperty(java.library.path, /usr/local/lib); /* The above line does not work, so I have to do something * like following line. */ addDir(new String(/usr/local/lib)); System.loadLibrary(myclass); } catch(UnsatisfiedLinkError e) { LOG.info( Cannot load library:\n + e.toString() ); } catch(IOException ioe) { LOG.info( IO error:\n + ioe.toString() ); } } /* Since the System.setProperty() does not work, I have to add the following * function to force the path is added in java.library.path */ public static void addDir(String s) throws IOException { try { Field field = ClassLoader.class.getDeclaredField(usr_paths); field.setAccessible(true); String[] paths = (String[])field.get(null); for (int i = 0; i paths.length; i++) { if (s.equals(paths[i])) { return; } } String[] tmp = new String[paths.length+1]; System.arraycopy(paths,0,tmp,0,paths.length); tmp[paths.length] = s; field.set(null,tmp); } catch (IllegalAccessException e) { throw new IOException(Failed to get permissions to set library path); } catch (NoSuchFieldException e) { throw new IOException(Failed to get field handle to set library path); } } public native int myClassfsMount(String subsys); public native int myClassfsUmount(String subsys); } I don't know what missed in my codes and am wondering whether there are any rules in Hadoop I should obey if I want to achieve my target. FYI, the myClassfsMount() and myClassfsUmount() will open a socket to call services from a daemon. I would better if this design did not cause the fail in my codes. Any comments? Thanks in advance, Ian -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: streaming but no sorting
It may be simpler to just have a post processing step that uses something like multi-file input to aggregate the results. As a complete sideways thinking solution, I suspect you have far more map tasks than you have physical machines, instead of writing your output via output.collect, your tasks could open a 'side effect file' and append to it, since these are in the local file system you actually have the ability to append to them. You will need to play some interesting games with the OutputCommitter though. An alternative would be to write N output records, where N is the number of reduces, where each of the N keys is guaranteed to to a unique reduce task, and the value of the record is the local file name and the host name. The side effect files would need to be written into the job working area or some public area on the node., rather than the task output area, or the output committer could place them in the proper place (that way failed tasks are handled correctly). The reduce then reads the keys it has, opens and concatinates what files are on it's machine, and very very little sorting happens. Each reduce then collects the side effect files that 2009/4/28 Dmitry Pushkarev u...@stanford.edu Hi. I'm writing streaming based tasks that involves running thousands of mappers, after that I want to put all these outputs into small number (say 30) output files mainly so that disk space will be used more efficiently, the way I'm doing it right now is using /bin/cat as reducer and setting number of reducers to desired. This involves two highly ineffective (for the task) steps - sorting and fetching. Is there a way to get around that? Ideally I'd want all mapper outputs to be written to one file, one record per line. Thanks. --- Dmitry Pushkarev +1-650-644-8988 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: streaming but no sorting
There has to be a simpler way :) On Tue, Apr 28, 2009 at 9:22 PM, jason hadoop jason.had...@gmail.comwrote: It may be simpler to just have a post processing step that uses something like multi-file input to aggregate the results. As a complete sideways thinking solution, I suspect you have far more map tasks than you have physical machines, instead of writing your output via output.collect, your tasks could open a 'side effect file' and append to it, since these are in the local file system you actually have the ability to append to them. You will need to play some interesting games with the OutputCommitter though. An alternative would be to write N output records, where N is the number of reduces, where each of the N keys is guaranteed to to a unique reduce task, and the value of the record is the local file name and the host name. The side effect files would need to be written into the job working area or some public area on the node., rather than the task output area, or the output committer could place them in the proper place (that way failed tasks are handled correctly). The reduce then reads the keys it has, opens and concatinates what files are on it's machine, and very very little sorting happens. Each reduce then collects the side effect files that 2009/4/28 Dmitry Pushkarev u...@stanford.edu Hi. I'm writing streaming based tasks that involves running thousands of mappers, after that I want to put all these outputs into small number (say 30) output files mainly so that disk space will be used more efficiently, the way I'm doing it right now is using /bin/cat as reducer and setting number of reducers to desired. This involves two highly ineffective (for the task) steps - sorting and fetching. Is there a way to get around that? Ideally I'd want all mapper outputs to be written to one file, one record per line. Thanks. --- Dmitry Pushkarev +1-650-644-8988 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: How to write large string to file in HDFS
How about new InputStreamReader( new StringReader( String ), UTF-8 ) replace UTF-8 with an appropriate charset. On Tue, Apr 28, 2009 at 7:47 PM, nguyenhuynh.mr nguyenhuynh...@gmail.comwrote: Hi all! I have the large String and I want to write it into the file in HDFS. (The large string has 100.000 lines.) Current, I use method copyBytes of class org.apache.hadoop.io.IOUtils. But the copyBytes request the InputStream of content. Therefore, I have to convert the String to InputStream, some things like: InputStream in=new ByteArrayInputStream(sb.toString().getBytes()); The sb is a StringBuffer. It not work with the command line above. :( There is the error: Exception in thread main java.lang.OutOfMemoryError: Java heap space at java.lang.StringCoding$StringEncoder.encode(StringCoding.java:232) at java.lang.StringCoding.encode(StringCoding.java:272) at java.lang.String.getBytes(String.java:947) at asnet.haris.mapred.jobs.Test.main(Test.java:32) Please give me the good solution! Thanks, Best regards, Nguyen, -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: Storing data-node content to other machine
There is no requirement that your hdfs and mapred clusters share an installation directory, it is just done that way because it is simple and most people have a datanode and tasktracker on each slave node. Simply have 2 configuration directories on your cluster machines, and us the bin/start-dfs.sh script in one, and the bin/start-mapred.sh script in the other, and maintain different slaves files in the two directories. You will loose the benefit of data locality for your tasktrackers which do not reside on the datanode machines. On Sun, Apr 26, 2009 at 10:06 PM, Vishal Ghawate vishal_ghaw...@persistent.co.in wrote: Hi, I want to store the contents of all the client machine(datanode)of hadoop cluster to centralized machine with high storage capacity.so that tasktracker will be on the client machine but the contents are stored on the centralized machine. Can anybody help me on this please. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails. -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
Re: IO Exception in Map Tasks
The jvm had a hard failure and crashed On Sun, Apr 26, 2009 at 11:34 PM, Rakhi Khatwani rakhi.khatw...@gmail.comwrote: Hi, In one of the map tasks, i get the following exception: java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424) java.io.IOException: Task process exit with nonzero status of 255. at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:424) what could be the reason? Thanks, Raakhi -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422