Re: Need help in hdfs configuration fully distributed way in Mac OSX...
Hi: You need to configure your nodes to ensure that node 1 can connect to node 2 without password. On Tue, Sep 16, 2008 at 2:04 PM, souravm <[EMAIL PROTECTED]> wrote: > Hi All, > > I'm facing a problem in configuring hdfs in a fully distributed way in Mac > OSX. > > Here is the topology - > > 1. The namenode is in machine 1 > 2. There is 1 datanode in machine 2 > > Now when I execute start-dfs.sh from machine 1, it connects to machine 2 > (after it asks for password for connecting to machine 2) and starts datanode > in machine 2 (as the console message says). > > However - > 1. When I go to http://machine1:50070 - it does not show the data node at > all. It says 0 data node configured > 2. In the log file in machine 2 what I see is - > / > STARTUP_MSG: Starting DataNode > STARTUP_MSG: host = rc0902b-dhcp169.apple.com/17.229.22.169 > STARTUP_MSG: args = [] > STARTUP_MSG: version = 0.17.2.1 > STARTUP_MSG: build = > https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r > 684969; compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008 > / > 2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 1 time(s). > 2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 2 time(s). > 2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 3 time(s). > 2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 4 time(s). > 2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 5 time(s). > 2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 6 time(s). > 2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 7 time(s). > 2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 8 time(s). > 2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 9 time(s). > 2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect > to server: /17.229.23.77:9000. Already tried 10 time(s). > 2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at / > 17.229.23.77:9000 not available yet, Z... > > ... and this retyring gets on repeating > > > The hadoop-site.xmls are like this - > > 1. In machine 1 > - > > > >fs.default.name >hdfs://localhost:9000 > > > >dfs.name.dir >/Users/souravm/hdpn > > > >mapred.job.tracker >localhost:9001 > > >dfs.replication >1 > > > > > 2. In machine 2 > > > > >fs.default.name >hdfs://:9000 > > >dfs.data.dir >/Users/nirdosh/hdfsd1 > > >dfs.replication >1 > > > > The slaves file in machine 1 has single entry - @ machine2> > > The exact steps I did - > > 1. Reformat the namenode in machine 1 > 2. execute start-dfs.sh in machine 1 > 3. Then I try to see whether the datanode is created through http:// 1 ip>:50070 > > Any pointer to resolve this issue would be appreciated. > > Regards, > Sourav > > > > CAUTION - Disclaimer * > This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended > solely > for the use of the addressee(s). If you are not the intended recipient, > please > notify the sender by e-mail and delete the original message. Further, you > are not > to copy, disclose, or distribute this e-mail or its contents to any other > person and > any such actions are unlawful. This e-mail may contain viruses. Infosys has > taken > every reasonable precaution to minimize this risk, but is not liable for > any damage > you may sustain as a result of any virus in this e-mail. You should carry > out your > own virus checks before opening the e-mail or attachment. Infosys reserves > the > right to monitor and review the content of all messages sent to or from > this e-mail > address. Messages sent to or from this e-mail address may be stored on the > Infosys e-mail system. > ***INFOSYS End of Disclaimer INFOSYS*** > -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Need help in hdfs configuration fully distributed way in Mac OSX...
Hi All, I'm facing a problem in configuring hdfs in a fully distributed way in Mac OSX. Here is the topology - 1. The namenode is in machine 1 2. There is 1 datanode in machine 2 Now when I execute start-dfs.sh from machine 1, it connects to machine 2 (after it asks for password for connecting to machine 2) and starts datanode in machine 2 (as the console message says). However - 1. When I go to http://machine1:50070 - it does not show the data node at all. It says 0 data node configured 2. In the log file in machine 2 what I see is - / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = rc0902b-dhcp169.apple.com/17.229.22.169 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.2.1 STARTUP_MSG: build = https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 684969; compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008 / 2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 1 time(s). 2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 2 time(s). 2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 3 time(s). 2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 4 time(s). 2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 5 time(s). 2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 6 time(s). 2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 7 time(s). 2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 8 time(s). 2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 9 time(s). 2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: /17.229.23.77:9000. Already tried 10 time(s). 2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at /17.229.23.77:9000 not available yet, Z... ... and this retyring gets on repeating The hadoop-site.xmls are like this - 1. In machine 1 - fs.default.name hdfs://localhost:9000 dfs.name.dir /Users/souravm/hdpn mapred.job.tracker localhost:9001 dfs.replication 1 2. In machine 2 fs.default.name hdfs://:9000 dfs.data.dir /Users/nirdosh/hdfsd1 dfs.replication 1 The slaves file in machine 1 has single entry - @ The exact steps I did - 1. Reformat the namenode in machine 1 2. execute start-dfs.sh in machine 1 3. Then I try to see whether the datanode is created through http://:50070 Any pointer to resolve this issue would be appreciated. Regards, Sourav CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
[HELP] "CreateProcess error=193" When running "ant -Dcompile.c++=yes examples"
Hello, I got a problem when I compiled hadoop's pipes' example(under cygwin). and I searched a lot, and find nobody have met the same problem. could you please give me some suggestions, thanks very much. when I run $ ant $ ant examples both BUILD SUCCESSFUL but when I run $ ant -Dcompile.c++=yes examples the error info list below: Buildfile: build.xml clover.setup: clover.info: [echo] [echo] Clover not found. Code coverage reports disabled. [echo] clover: init: [touch] Creating c:\DOCUME~1\GODBLE~1\LOCALS~1\Temp\null1553129825 [delete] Deleting: c:\DOCUME~1\GODBLE~1\LOCALS~1\Temp\null1553129825 [exec] src/saveVersion.sh: line 22: svn: command not found [exec] src/saveVersion.sh: line 23: svn: command not found record-parser: compile-rcc-compiler: compile-core-classes: [javac] Compiling 1 source file to E:\cygwin\home\godblesswho\hadoopinstall\ hadoop-0.18.0\build\classes compile-core-native: check-c++-makefiles: create-c++-pipes-makefile: BUILD FAILED E:\cygwin\home\godblesswho\hadoopinstall\hadoop-0.18.0\build.xml:1115: Execute f ailed: java.io.IOException: Cannot run program "E:\cygwin\home\godblesswho\hadoo pinstall\hadoop-0.18.0\src\c++\pipes\configure" (in directory "E:\cygwin\home\go dblesswho\hadoopinstall\hadoop-0.18.0\build\c++-build\Windows_XP-x86-32\pipes"): CreateProcess error=193, %1 Ч Win32 ó
Lots of connections in TIME_WAIT
Hi all, We have a 10 node cluster and I noticed that there are a large amount of connections in TIME_WAIT status over 400 to more precise. Does anyone else see so many unclosed connections or have done something wrong in my config. It appears to be coming from and going to other datanodes. I was running the javasort benchmark when I noticed it. Regards Damien
Re: Small Filesizes
Hi, I'm just working on this situation you described, with millions of small files sized around 10KB. My idea is to compact this files into big ones and create indexes for them. This is a file system over file system and support append update, lazy delete. May this help . -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: How to manage a large cluster?
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To make it simple, pull those from S3 buckets. That provides a flexible pattern for managing the frameworks involved, more so than needing to re-do an EC2 image whenever you want to add a patch to Hadoop. Given that approach, you can add your Hadoop application code similarly. Just upload the current stable build out of SVN, Git, whatever, to an S3 bucket. We use a set of Python scripts to manage a daily, (mostly) automated launch of 100+ EC2 nodes for a Hadoop cluster. We also run a listener on a local server, so that the Hadoop job can send notification when it completes, and allow the local server to initiate download of results. Overall, that minimizes the need for having a sysadmin dedicated to the Hadoop jobs -- a small dev team can handle it, while focusing on algorithm development and testing. >> Or on EC2 and its competitors, just build a new image whenever you >>> need to update Hadoop itself.
Re: How to manage a large cluster?
Sorry, but I can't open it: http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment 2008/9/13 Steve Loughran <[EMAIL PROTECTED]> > James Moore wrote: > >> On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <[EMAIL PROTECTED]> >> wrote: >> >>> On 9/11/08 2:39 AM, "Alex Loddengaard" <[EMAIL PROTECTED]> wrote: >>> I've never dealt with a large cluster, though I'd imagine it is managed the same way as small clusters: >>> Maybe. :) >>> >> > Depends how often you like to be paged, doesn't it :) > > >>Instead, use a real system configuration management package such as >>> bcfg2, smartfrog, puppet, cfengine, etc. [Steve, you owe me for the >>> plug. >>> :) ] >>> >> > Yes Allen, I owe you beer at the next apachecon we are both at. > Actually, I think Y! were one of the sponsors at the UK event, so we owe > you for that too. > > > Or on EC2 and its competitors, just build a new image whenever you >> need to update Hadoop itself. >> > > > 1. It's still good to have as much automation of your image build as you > can; if you can build new machine images on demand you have have fun/make a > mess of things. Look at http://instalinux.com to see the web GUI for > creating linux images on demand that is used inside HP. > > 2. When you try and bring up everything from scratch, you have a > choreography problem. DNS needs to be up early, and your authentication > system, the management tools, then the other parts of the system. If you > have a project where hadoop is integrated with the front end site, for > example, you're app servers have to stay offline until HDFS is live. So it > does get complex. > > 3. The Hadoop nodes are good here in that you aren't required to bring up > the namenode first; the datanodes will wait; same for the task trackers and > job tracker. But if you, say, need to point everything at a new hostname for > the namenode, well, that's a config change that needs to be pushed out, > somehow. > > > > I'm adding some stuff on different ways to deploy hadoop here: > > http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment > > -steve > -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: Thinking about retriving DFS metadata from datanodes!!!
Thanks, Steve Loughran. I learned sth from you! 2008/9/13 Steve Loughran <[EMAIL PROTECTED]> > 叶双明 wrote: > >> Thanks for paying attention to my tentative idea! >> >> What I thought isn't how to store the meradata, but the final (or last) >> way >> to recover valuable data in the cluster when something worst (which >> destroy >> the metadata in all multiple NameNode) happen. i.e. terrorist attack or >> natural disasters destroy half of cluster nodes within all NameNode, we >> can >> recover as much data as possible by this mechanism, and hava big chance to >> recover entire data of cluster because fo original replication. >> > > > You want to survive any event that loses a datacentre, you need to mirror > the data off site, chosing that second site with an up to date fault line > map of the city, geological knowledge of where recent eruptions ended up > etc. Which is why nobody builds datacentres in Enumclaw WA that I'm aware > of, the spec for the fabs in/near portland is they ought to withstand 1-2m > of volcanic ash landing on them (what they'd have got if there'd been an > easterly wind when Mount Saint Helens went). Then once you have some safe > location for the second site, talk to your telco about how the > high-bandwidth backbones in your city flow (Metropolitan Area Ethernet and > the like), and try and find somewhere that meets your requirements. > > Then: come up with a protocol that efficiently keeps the two sites up to > date. And reliably: S3 went down last month because they'd been using a > Gossip-style update protocol but wheren't checksumming everything, because > there's no need on a LAN, but of course on a cross-city network more things > can go wrong, and for them it did. > > Something to keep multiple hadoop filesystems synchronised efficiently and > reliably across sites could be very useful to many people. > > -steve > -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: guaranteeing disk space?
On Sep 15, 2008, at 11:24 AM, Kayla Jay wrote: How does one do a check or guarantee there's enough disk space when running a hadoop job that you're not sure how much it will produce in its results (temp files, etc) ? In 0.19 there is new code that waits until the first N% of maps are run and estimates the amount of space required for each of the following tasks. You can see the discussion here: https://issues.apache.org/jira/browse/HADOOP-657 The task tracker can also set the mapred.local.dir.minspacestart variable, which controls the minimum amount of disk space that must be free before it will ask for a new task. Or, what if you run out of disk space on the HDFS if you are running large jobs with large outputs ? The job just fails .. but how can one assess this resource allocation of disk space while running your jobs? Map/Reduce works by re-executing tasks that fail, including tasks that fail for lack of disk space. If the task fails, the partial results are erased on the assumption that they will be run later. The tasks that finish, will have their output in the output directory, even if the job fails. -- Owen
Re: guaranteeing disk space?
I'm not sure if this completely answers your question, but I don't think there is something built-in inside hadoop that automizes cleanup (between MR phases). You may have to do it yourself afterwards. Are you dealing with multiple map reduce phases? If so, your intermediary files (which are stored in the intermediary directory of your choice) can just be deleted afterwards. When I'm running multiple jobs I usually write a wrapper script that runs the job and removes the intermediary files before it runs the next job. #!/bin/bash hadoop jar myjar.jar myjob input intermediate output bin/hadoop -rmr intermediate hadoop jar myjar.jar myjob input2 intermediate output2 I had disk space errors when I was running my job on a previous machine. The problem is that too much was filling up in the logs (so I took out some stdout statements that I was using for debugging purposes), and then I added the whole -rmr thing to my scripts. But what will definitely work is getting a machine with more hard disk. Good luck, -SM On Mon, Sep 15, 2008 at 1:24 PM, Kayla Jay <[EMAIL PROTECTED]> wrote: > How does one do a check or guarantee there's enough disk space when running > a hadoop job that you're not sure how much it will produce in its results > (temp files, etc) ? > > I.e when you run a hadoop job and you're not exactly sure how much disk > space it will eat up (given temp dirs), the job will fail if it does run > out. > > How do you guarantee while you're job is running that there's enough disk > space on the nodes and kick off cleanup (so the job won't fail) if you're > running into low disk space? > > For example, if your maps are failing since there isn't enough temporary > disk space on your nodes while you run a job, how can you fix that up front > prior to running or better yet while the job is running from causing a > failed job? The outputs of maps are stored on the local-disk of the nodes > on which they were executed, and if your nodes don't have enough while > running jobs, how can you fix this at run time? Can I catch this condition > at all? > > Is there a way to fix this at run time? How do others solve this issue > when running jobs that you're not sure how much disk space it will consume? > > --- > Or, what if you run out of disk space on the HDFS if you are running large > jobs with large outputs ? The job just fails .. but how can one assess this > resource allocation of disk space while running your jobs? > > If you run out of HDFS disk space, and you know you want the results of job > X, is there a way to find out while running that you can do some smart > cleanup as to not lose what data could've been produced by job X? > > > >
guaranteeing disk space?
How does one do a check or guarantee there's enough disk space when running a hadoop job that you're not sure how much it will produce in its results (temp files, etc) ? I.e when you run a hadoop job and you're not exactly sure how much disk space it will eat up (given temp dirs), the job will fail if it does run out. How do you guarantee while you're job is running that there's enough disk space on the nodes and kick off cleanup (so the job won't fail) if you're running into low disk space? For example, if your maps are failing since there isn't enough temporary disk space on your nodes while you run a job, how can you fix that up front prior to running or better yet while the job is running from causing a failed job? The outputs of maps are stored on the local-disk of the nodes on which they were executed, and if your nodes don't have enough while running jobs, how can you fix this at run time? Can I catch this condition at all? Is there a way to fix this at run time? How do others solve this issue when running jobs that you're not sure how much disk space it will consume? --- Or, what if you run out of disk space on the HDFS if you are running large jobs with large outputs ? The job just fails .. but how can one assess this resource allocation of disk space while running your jobs? If you run out of HDFS disk space, and you know you want the results of job X, is there a way to find out while running that you can do some smart cleanup as to not lose what data could've been produced by job X?
Re: Small Filesizes
Peter, You are likely to hit memory limitations on the name-node. With 100 million small files it will need to support 200 mln objects, which will require roughly 30 GB of RAM on the name-node. You may also consider hadoop archives or present your files as a collection of records and use Pig, Hive etc. --Konstantin Brian Vargas wrote: -BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Peter, In my testing with files of that size (well, larger, but still well below the block size) it was impossible to achieve any real throughput on the data because of the overhead of looking up the locations to all those files on the NameNode. Your application spends so much time looking up file names that most of the CPUs sit idle. A simple solution is to just load all of the small files into a sequence file, and process the sequence file instead. Brian Peter McTaggart wrote: Hi All, I am considering using HDFS for an application that potentially has many small files – ie 10-100 million files with an estimated average filesize of 50-100k (perhaps smaller) and is an online interactive application. All of the documentation I have seen suggests that a blockszie of 64-128Mb works best for Hadoop/HDFS and it is best used for batch oriented applications. Does anyone have any experience using it for files of this size in an online application environment? Is it worth pursuing HDFS for this type of application? Thanks Peter -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: What is this? http://pgp.ardvaark.net iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX dIHsqlePzQHQYFr9AwhkI3I= =gmAj -END PGP SIGNATURE-
Re: SequenceFiles and binary data
On Sep 14, 2008, at 7:15 PM, John Howland wrote: If I want to read values out of input files as binary data, is this what BytesWritable is for? yes I've successfully run my first task that uses a SequenceFile for output. Are there any examples of SequenceFile usage out there? I'd like to see the full range of what SequenceFile can do. If you want serious usage, I'd suggest pulling up Nutch. Distcp also uses sequence files as its input. You should also probably look at the TFile package that Hong is writing. https://issues.apache.org/jira/browse/HADOOP-3315 Once it is ready, it will likely be exactly what you are looking for. What are the trade-offs between record compression and block compression? You pretty much always want block compression. The only place where record compression is ok, is if your value is web pages or some other huge chunk of text. What are the limits on the key and value sizes? Large. I think I've see keys and/or values of around 50-100mb. It certainly can't be bigger than 1g. I believe the TFile limit on keys may be 64k. How do you use the per-file metadata? It is just an application specific string to string map in the header of the file. My intended use is to read files on a local filesystem into a SequenceFile, with the value of each record being the contents of each file. I hacked MultiFileWordCount to get the basic concept working... You should also look at the Hadoop archives. http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html but I'd appreciate any advice from the experts. In particular, what's the most efficient way to read data from an InputStreamReader/BufferedReader into a BytesWritable object? The easiest way is the way you've done it. You probably want to use lzo compression too. -- Owen
Re: Implementing own InputFormat and RecordReader
On Sep 15, 2008, at 6:13 AM, Juho Mäkinen wrote: 1) The FileInputFormat.getSplits() returns InputSplit[] array. If my input file is 128MB and my HDFS block size is 64MB, will it return one InputSplit or two InputSplits? Your InputFormat needs to define: protected boolean isSplitable(FileSystem fs, Path filename) { return false; } which tells the FileInputFormat.getSplits to not split files. You will end up with a single split for each file. 2) If my file is splitted into two or more filesystem blocks, how will hadoop handle the reading of those blocks? As the file must be read in sequence, will hadoop first copy every block to a machine (if the blocks aren't already in there) and then start the mapper in this machine? Do I need to handle the reading and opening multiple blocks, or will hadoop provide me a simple stream interface which I can use to read the entire file without worrying if the file is larger than the HDFS block size? HDFS transparently handles the data motion for you. You can just use FileSystem.open(path) and HDFS will pull the file from the closest location. It doesn't actually move the block to your local disk, just gives it to the application. Basically, you don't need to worry about it. There are two downsides to unsplitable files. The first is that if they are large, the map times can be very long. The second is that the map/reduce scheduler tries to place the tasks close to the data, which it can't do very well if the data spans blocks. Of course if data isn't splitable, you don't have a choice. -- Owen
Re: streaming question
If testlink is a package, it should be: hadoop -jar streaming/hadoop-0.17.0-streaming.jar -input store -output cout -mapper MyProg -combiner testlink.combiner -reducer testlink.reduce -file /home/hadoop/MyProg -cacheFile /shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink if not a package, remove the testlink part. Dennis Christian Ulrik Søttrup wrote: Ok, so I added the JAR to the cacheArchive option and my command looks like this: hadoop jar streaming/hadoop-0.17.0-streaming.jar -input /store/ -output /cout/ -mapper MyProg -combiner testlink/combiner.class -reducer testlink/reduce.class -file /home/hadoop/MyProg -cacheFile /shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink Now it fails because it cannot find the combiner. The cacheArchive option creates a symlink in the local running directory, correct? Just like the cacheFile option? If not how can i then specify which class to use? cheers, Christian Amareshwari Sriramadasu wrote: Dennis Kubes wrote: If I understand what you are asking you can use the -cacheArchive with the path to the jar to including the jar file in the classpath of your streaming job. Dennis You can also use -cacheArchive option to include jar file and symlink the unjarred directory from cwd by providing the uri as hdfs://#link. You have to provide -reducer and -combiner options as appropriate paths in the unjarred directory. Thanks Amareshwari Christian Søttrup wrote: Hi all, I have an application that i use to run with the "hadoop jar" command. I have now written an optimized version of the mapper in C. I have run this using the streaming library and everything looks ok (using num.reducers=0). Now i want to use this mapper together with the combiner and reducer from my old .jar file. How do i do this? How can i distribute the jar and run the reducer and combiner from it? While also running the c program as the mapper in streaming mode. cheers, Christian
Implementing own InputFormat and RecordReader
I'm trying to implement my own InputFormat and RecordReader for my data and I'm stuck as I can't find enough documentation about the details. My input format is a tightly packed binary data consisting individual event packets. Each event packet contains its length and the packets are simply appended into end of an file. Thus the file must be read as stream and it cannot be splitted. FileInputFormat seems like a reasonable place to start but I immediately ran into problems and unanswered questions: 1) The FileInputFormat.getSplits() returns InputSplit[] array. If my input file is 128MB and my HDFS block size is 64MB, will it return one InputSplit or two InputSplits? 2) If my file is splitted into two or more filesystem blocks, how will hadoop handle the reading of those blocks? As the file must be read in sequence, will hadoop first copy every block to a machine (if the blocks aren't already in there) and then start the mapper in this machine? Do I need to handle the reading and opening multiple blocks, or will hadoop provide me a simple stream interface which I can use to read the entire file without worrying if the file is larger than the HDFS block size? - Juho Mäkinen
Re: Small Filesizes
-BEGIN PGP SIGNED MESSAGE- Hash: RIPEMD160 Peter, In my testing with files of that size (well, larger, but still well below the block size) it was impossible to achieve any real throughput on the data because of the overhead of looking up the locations to all those files on the NameNode. Your application spends so much time looking up file names that most of the CPUs sit idle. A simple solution is to just load all of the small files into a sequence file, and process the sequence file instead. Brian Peter McTaggart wrote: > Hi All, > > > > I am considering using HDFS for an application that potentially has many > small files – ie 10-100 million files with an estimated average filesize of > 50-100k (perhaps smaller) and is an online interactive application. > > All of the documentation I have seen suggests that a blockszie of 64-128Mb > works best for Hadoop/HDFS and it is best used for batch oriented > applications. > > > > Does anyone have any experience using it for files of this size in an > online application environment? > > Is it worth pursuing HDFS for this type of application? > > > > Thanks > > Peter > -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (MingW32) Comment: What is this? http://pgp.ardvaark.net iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX dIHsqlePzQHQYFr9AwhkI3I= =gmAj -END PGP SIGNATURE-
Re: streaming question
Ok, so I added the JAR to the cacheArchive option and my command looks like this: hadoop jar streaming/hadoop-0.17.0-streaming.jar -input /store/ -output /cout/ -mapper MyProg -combiner testlink/combiner.class -reducer testlink/reduce.class -file /home/hadoop/MyProg -cacheFile /shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink Now it fails because it cannot find the combiner. The cacheArchive option creates a symlink in the local running directory, correct? Just like the cacheFile option? If not how can i then specify which class to use? cheers, Christian Amareshwari Sriramadasu wrote: Dennis Kubes wrote: If I understand what you are asking you can use the -cacheArchive with the path to the jar to including the jar file in the classpath of your streaming job. Dennis You can also use -cacheArchive option to include jar file and symlink the unjarred directory from cwd by providing the uri as hdfs://#link. You have to provide -reducer and -combiner options as appropriate paths in the unjarred directory. Thanks Amareshwari Christian Søttrup wrote: Hi all, I have an application that i use to run with the "hadoop jar" command. I have now written an optimized version of the mapper in C. I have run this using the streaming library and everything looks ok (using num.reducers=0). Now i want to use this mapper together with the combiner and reducer from my old .jar file. How do i do this? How can i distribute the jar and run the reducer and combiner from it? While also running the c program as the mapper in streaming mode. cheers, Christian
Re: Why can't Hadoop be used for online applications ?
James Moore wrote: On Fri, Sep 12, 2008 at 12:28 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: Hey Camilo, Most of the time, I'm working on code where we're just using SQL as a second-rate way to serialize and deserialize objects. Aah, O/R mapping...technology to hide SQL from developers who aren't deemed competent to understand it. Often seen in conjunction with WS-* toolkits to hide XML from the same developers. I am co-authoring a set of slides/a paper on this whole topic that I will stick up online soon; bits of it surfaced at a ThoughtWorks event in London last week. The key concepts are 1. The past The whole architecture of Java Enterprise Edition is optimised for a deployment of applications over a small cluster of application servers with a (clustered) relational database behind the scenes to glue everything together. 2. The present Social applications, have a scale and a need for large datamining activities for the end users, management, operations and affiliates that cannot be met by this architecture. Hadoop and a distributed filesystem provide some of the solution, but not all. 3. The future We propose "Java Cloud Computing Edition", which would be an application architecture for future Java applications that was designed for this world. It would start with dynamic machine deployment of a distributed filesystem, run hadoop, HBase and the like on top, and have a front end designed to bind to this and scale out. And it would all be Open Source, to avoid getting lost in JSR committees and sabotaged by the app server and database vendors,. Like I said, I'm still working on the idea. But I like it. And I dont think trying to push EJB and similar into a world with data kept on HBase and HDFS is the right approach. -steve