sub
Hello, hadoop~ -- --from feng:) - - - - - - - - - - - - - - - - - - - - - - - - - - Blog: www.jferic.com/blog Email: cjfsmart...@gmail.com Studio: ws.nju.edu.cn - - - - - - - - - - - - - - - - - - - - - - - - - -
33 Days left to Berlin Buzzwords 2011
hey folks, BerlinBuzzwords 2011 is close only 33 days left until the big Search, Store and Scale opensource crowd is gathering in Berlin on June 6th/7th. The conference again focuses on the topics search, data analysis and NoSQL. It is to take place on June 6/7th 2011 in Berlin. We are looking forward to two awesome keynote speakers who shaped the world of open source data analysis: Doug Cutting, founder of Apache Lucene and Hadoop) as well as Ted Dunning (Chief Application Architect at MapR Technologies and active developer at Apache Hadoop and Mahout). We are amazed by the amount and quality of the talk submissions we got. As a result this year we have added one more track to the main conference. If you haven't done so already, make sure to book your ticket now - early bird tickets are already sold out since April 7th and there might not be many tickets left. As we would like to give visitors of our main conference a reason to stay in town for the whole week, we have been talking to local co-working spaces and companies asking them for free space and WiFi to host Hackathons right after the main conference - that is on June 8th through 10th. If you would like to gather with fellow developers and users of your project, fix bugs together, hack on new features or give users a hands-on introduction to your tools, please submit your workshop proposal to our wiki: http://berlinbuzzwords.de/node/428 Please note that slots are assigned on a first come first serve basis. We are doing our best to get you connected, however space is limited. The deal is simple: We get you in touch with a conference room provider. Your event gets promoted in our schedule. Co-Ordination however is completely up to you: Make sure to provide an interesting abstract, provide a Hackathon registration area - see the Barcamp page for a good example: http://berlinbuzzwords.de/wiki/barcamp Attending Hackathons requires a Berlin Buzzwords ticket and (then free) registration at the Hackathon in question. Hope I see you all around in Berlin, Simon
Change block size from 64M to 128M does not work on Hadoop-0.21
Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen
Re: How do I create per-reducer temporary files?
Hi Bryan, These are called side effect files, and I use them extensively: O'Riley Hadoop 2nd Edition, p. 187 Pro Hadoop, p. 279 You get the path to the save the file(s) using: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29 The output committer moves these files from the work directory to the output directory when the task completes. That way you don't have duplicate files due to speculative execution. You should also generate a unique name for each of your output files by using this function to prevent file name collisions: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java .lang.String%29 Hope this helps, Matt On 5/4/11 12:18 PM, Bryan Keller brya...@gmail.com wrote: Right. What I am struggling with is how to retrieve the path/drive that the reducer is using, so I can use the same path for local temp files. On May 4, 2011, at 9:03 AM, Robert Evans wrote: Bryan, I believe that map/reduce gives you a single drive to write to so that your reducer has less of an impact on other reducers/mappers running on the same box. If you want to write to more drives I thought the idea would then be to increase the number of reducers you have and let mapred assign each to a drive to use, instead of having one reducer eating up I/O bandwidth from all of the drives. --Bobby Evans On 5/4/11 7:11 AM, Bryan Keller brya...@gmail.com wrote: I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The mapred.child.tmp property is by default simply the relative directory ./tmp so it isn't useful on it's own. I have 5 drives being used in mapred.local.dir, and I was hoping to use them all for writing temp files, rather than specifying a single temp directory that all my reducers use. On Apr 9, 2011, at 2:40 AM, Harsh J wrote: Hello, On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill bill...@gmail.com wrote: If I try: storePath = FileOutputFormat.getPathForWorkFile(context, my-file, .seq); writer = SequenceFile.createWriter(FileSystem.getLocal(configuration), configuration, storePath, IntWritable.class, itemClass); ... reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), storePath, configuration); I get an exception about a mismatch in file systems when trying to read from the file. Alternately if I try: storePath = new Path(SequenceFileOutputFormat.getUniqueFile(context, my-file, .seq)); writer = SequenceFile.createWriter(FileSystem.get(configuration), configuration, storePath, IntWritable.class, itemClass); ... reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), storePath, configuration); FileOutputFormat.getPathForWorkFile will give back HDFS paths. And since you are looking to create local temporary files to be used only by the task within itself, you shouldn't really worry about unique filenames (stuff can go wrong). You're looking for the tmp/ directory locally created in the FS where the Task is running (at ${mapred.child.tmp}, which defaults to ./tmp). You can create a regular file there using vanilla Java APIs for files, or using RawLocalFS + your own created Path (not derived via OutputFormat/etc.). storePath = new Path(new Path(context.getConf().get(mapred.child.tmp), my-file.seq); writer = SequenceFile.createWriter(FileSystem.getLocal(configuration), configuration, storePath, IntWritable.class, itemClass); ... reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), storePath, configuration); The above should work, I think (haven't tried, but the idea is to use the mapred.child.tmp). Also see: http://hadoop.apache.org/common/docs/r0.20.0/mapred_tutorial.html#Directory+ Structure -- Harsh J iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
don't want to output anything
Hi, I use MapReduce to process and output my own stuff, in a customized way. I don't use context.write to output anything, and thus I don't want the empty files part-r-x on my fs. Is there someway to eliminate the output? Thanks. -Gang
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Tried second solution. Does not work, still 2 64M blocks. h On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote: Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: don't want to output anything
Exactly what I want. Thanks Harsh J. -Gang - 原始邮件 发件人: Harsh J ha...@cloudera.com 收件人: common-user@hadoop.apache.org 发送日期: 2011/5/4 (周三) 4:03:35 下午 主 题: Re: don't want to output anything Hello Gang, On Thu, May 5, 2011 at 1:22 AM, Gang Luo lgpub...@yahoo.com.cn wrote: Hi, I use MapReduce to process and output my own stuff, in a customized way. I don't use context.write to output anything, and thus I don't want the empty files part-r-x on my fs. Is there someway to eliminate the output? You're looking for the NullOutputFormat: http://search-hadoop.com/?q=nulloutputformat -- Harsh J
(nfs) outputdir
Hello I'm using a small fully distributed Hadoop cluster. All Hadoop daemons run under hadoop users, and I submit jobs as user. I ran into a couple of problems when I set mapred.output.dir to an (nfs) file:// location. 1. The output dir gets created, but it belongs to hadoop. It sort of makes sense, the processes writing the output files run as hadoop users. I would like the resulting output dir to belong to user (same as when setting mapred.output.dir=hdfs://...) 2. My job driver creates a report file in the output dir after the job is complete. However, the job driver is run by user, which doesn't have permissions to write in output dir (output dir belongs to hadoop). Forcing the users to run jobs as hadoop (the hadoop admin user) is a poor option. Do I have any other choices? Thank you very much Gabriel Balan -- The statements and opinions expressed here are my own and do not necessarily represent those of Oracle Corporation.
Re: Change block size from 64M to 128M does not work on Hadoop-0.21
Got it. Thankyou Harsh. BTW It is `hadoop dfs -Ddfs.blocksize=size -put file file`. No dot between block and size On Wed, May 4, 2011 at 3:18 PM, He Chen airb...@gmail.com wrote: Tried second solution. Does not work, still 2 64M blocks. h On Wed, May 4, 2011 at 3:16 PM, He Chen airb...@gmail.com wrote: Hi Harsh Thank you for the reply. Actually, the hadoop directory is on my NFS server, every node reads the same file from NFS server. I think this is not a problem. I like your second solution. But I am not sure, whether the namenode will divide those 128MB blocks to smaller ones in future or not. Chen On Wed, May 4, 2011 at 3:00 PM, Harsh J ha...@cloudera.com wrote: Your client (put) machine must have the same block size configuration during upload as well. Alternatively, you may do something explicit like `hadoop dfs -Ddfs.block.size=size -put file file` On Thu, May 5, 2011 at 12:59 AM, He Chen airb...@gmail.com wrote: Hi all I met a problem about changing block size from 64M to 128M. I am sure I modified the correct configuration file hdfs-site.xml. Because I can change the replication number correctly. However, it does not work on block size changing. For example: I change the dfs.block.size to 134217728 bytes. I upload a file which is 128M and use fsck to find how many blocks this file has. It shows: /user/file1/file 134217726 bytes, 2 blocks(s): OK 0. blk_xx len=67108864 repl=2 [192.168.0.3:50010, 192.168.0.32:50010 ] 1. blk_xx len=67108862 repl=2 [192.168.0.9:50010, 192.168.0.8:50010] The hadoop version is 0.21. Any suggestion will be appreciated! thanks Chen -- Harsh J
Re: Cluster hard drive ratios
Hey Matt, we are using the same Dell boxes, and we can get 2 GB/s per node (read and write) without problems. On Wed, May 4, 2011 at 8:43 AM, Matt Goeke msg...@gmail.com wrote: I have been reviewing quite a few presentations on the web from various businesses, in addition to the ones I watched first hand at the cloudera data summit last week, and I am curious as to others thoughts around hard drive ratios. Various sources including Cloudera have sited 1 HDD x 2 cores x 4 GB ECC but this makes me wonder what the upper bound for HDDs is in this ratio. We have specced out various machines from Dell and it is possible to get dual hexacores with 14 drives (2 raided for OS and 12x2TB) but this seems to conflict with that original ratio and some of the specs I have witnessed in presentations (which are mostly 4 drive configurations). I would assume all you incur is additional complexity and more potential for hardware failure on a specific machine but I have seen little to no data stating at what point there is a plateau in write speed performance. Can anyone give personal experience around this type of setup? If we accept that we are incurring the negatives I stated above but we gain higher data density in the cluster then is this setup fine or we overlooking something? Thanks, Matt
bin/start-dfs/mapred.sh with input slave file
Hi all, I see that there is an option to provide a slaves_file as input to bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this input file rather than the default conf/slaves. Can someone please help me with the syntax for this. I am not able to figure this out. Thanks, Matthew John
Re: bin/start-dfs/mapred.sh with input slave file
Keep two configuration directories with different slaves files (say conf.dfs/ and conf.mr/) and use `hadoop-daemons.sh --config {conf dir path} start {daemon}` to start up DN/TT daemons. On Thu, May 5, 2011 at 8:06 AM, Matthew John tmatthewjohn1...@gmail.com wrote: Hi all, I see that there is an option to provide a slaves_file as input to bin/start-dfs.sh and bin/start-mapred.sh so that slaves are parsed from this input file rather than the default conf/slaves. Can someone please help me with the syntax for this. I am not able to figure this out. Thanks, Matthew John -- Harsh J
Re: How do I create per-reducer temporary files?
Bryan, Not sure you should be concerned with whether the output is on local vs. HDFS. I wouldn't think there would be much of a performance difference if you are doing streaming output (append) in both cases. Hadoop already uses local storage where ever possible (including for the task working directories as far as I know). I've never had performance problems with side effect files, as long as the correct setup is used. Definitely if multiple mounts are available locally where the tasks are running you can add a comma delimited list to mapreduce.cluster.local.dir in mapred-site.xml of those machines: http://hadoop.apache.org/common/docs/current/cluster_setup.html#mapred-site. xml Theoretically you can use the methods I listed below to create unique files/paths under /tmp or any other mount point you wish. However, it is much better to let Hadoop manage where the files are stored (i.e. Use the work directory given to you). If you add multiple paths to mapreduce.cluster.local.dir then Hadoop will spread the I/O from multiple mappers/reducers across these paths. Likewise you can mount RAID 0 (stripe) of multiple drives to get the same effect. You can use a single RAID 0 to keep the mapred-site.xml uniform. RAID 0 is fine since speculative execution takes care of if a disk fails. If would be helpful to know your use case since the primary option is normally to create multiple outputs from a reducer: http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapred uce/lib/output/MultipleOutputs.html Most likely you should try that before going into the realm of side effect files (or messing with local temp on the task nodes). Try the multiple outputs if you are dealing with streaming data. If you absolutely cannot get it to work then you may have to cross check the other more complex options. Cheers, Matt On 5/4/11 1:07 PM, Bryan Keller brya...@gmail.com wrote: Am I mistaken or are side-effect files on HDFS? I need my temp files to be on the local filesystem. Also, the java working directory is not the reducer's local processing directory, thus ./tmp doesn't get me what I'm after. As it stands now I'm using java.io.tmpdir which is not a long-term solution for me. I am looking to use the reducer's task-specific local directory which should be balanced across my local drives. On May 4, 2011, at 12:31 PM, Matt Pouttu-Clarke wrote: Hi Bryan, These are called side effect files, and I use them extensively: O'Riley Hadoop 2nd Edition, p. 187 Pro Hadoop, p. 279 You get the path to the save the file(s) using: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi leOutputFormat.html#getWorkOutputPath%28org.apache.hadoop.mapred.JobConf%29 The output committer moves these files from the work directory to the output directory when the task completes. That way you don't have duplicate files due to speculative execution. You should also generate a unique name for each of your output files by using this function to prevent file name collisions: http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/Fi leOutputFormat.html#getUniqueName%28org.apache.hadoop.mapred.JobConf,%20java .lang.String%29 Hope this helps, Matt On 5/4/11 12:18 PM, Bryan Keller brya...@gmail.com wrote: Right. What I am struggling with is how to retrieve the path/drive that the reducer is using, so I can use the same path for local temp files. On May 4, 2011, at 9:03 AM, Robert Evans wrote: Bryan, I believe that map/reduce gives you a single drive to write to so that your reducer has less of an impact on other reducers/mappers running on the same box. If you want to write to more drives I thought the idea would then be to increase the number of reducers you have and let mapred assign each to a drive to use, instead of having one reducer eating up I/O bandwidth from all of the drives. --Bobby Evans On 5/4/11 7:11 AM, Bryan Keller brya...@gmail.com wrote: I too am looking for the best place to put local temp files I create during reduce processing. I am hoping there is a variable or property someplace that defines a per-reducer temp directory. The mapred.child.tmp property is by default simply the relative directory ./tmp so it isn't useful on it's own. I have 5 drives being used in mapred.local.dir, and I was hoping to use them all for writing temp files, rather than specifying a single temp directory that all my reducers use. On Apr 9, 2011, at 2:40 AM, Harsh J wrote: Hello, On Tue, Apr 5, 2011 at 2:53 AM, W.P. McNeill bill...@gmail.com wrote: If I try: storePath = FileOutputFormat.getPathForWorkFile(context, my-file, .seq); writer = SequenceFile.createWriter(FileSystem.getLocal(configuration), configuration, storePath, IntWritable.class, itemClass); ... reader = new SequenceFile.Reader(FileSystem.getLocal(configuration), storePath, configuration); I get an