Re: Multiple k,v pairs from a single map - possible?
thank you very much . this is what i am looking for. 2009/3/27 Brian MacKay > > Amandeep, > > Add this to your driver. > > MultipleOutputs.addNamedOutput(conf, "PHONE",TextOutputFormat.class, > Text.class, Text.class); > > MultipleOutputs.addNamedOutput(conf, "NAME, >TextOutputFormat.class, Text.class, Text.class); > > > > And in your reducer > > private MultipleOutputs mos; > > public void reduce(Text key, Iterator values, >OutputCollector output, Reporter reporter) { > > > // namedOutPut = either PHONE or NAME > >while (values.hasNext()) { >String value = values.next().toString(); >mos.getCollector(namedOutPut, reporter).collect( >new Text(value), new Text(othervals)); >} >} > >@Override >public void configure(JobConf conf) { >super.configure(conf); >mos = new MultipleOutputs(conf); >} > >public void close() throws IOException { >mos.close(); >} > > > > By the way, have you had a change to post your Oracle fix to > DBInputFormat ? > If so, what is the Jira tag #? > > Brian > > -Original Message- > From: Amandeep Khurana [mailto:ama...@gmail.com] > Sent: Friday, March 27, 2009 5:46 AM > To: core-user@hadoop.apache.org > Subject: Multiple k,v pairs from a single map - possible? > > Is it possible to output multiple key value pairs from a single map > function > run? > > For example, the mapper outputing and > simultaneously... > > Can I write multiple output.collect(...) commands? > > Amandeep > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this message in error, please contact the sender and delete the material > from any computer. > > >
Re: Using HDFS to serve www requests
can you please explain exactly adding NIO bridge means what and how it can be done , what could be advantages in this case ? Steve Loughran wrote: > > Edward Capriolo wrote: >> It is a little more natural to connect to HDFS from apache tomcat. >> This will allow you to skip the FUSE mounts and just use the HDFS-API. >> >> I have modified this code to run inside tomcat. >> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample >> >> I will not testify to how well this setup will perform under internet >> traffic, but it does work. >> > > If someone adds an NIO bridge to hadoop filesystems then it would be > easier; leaving you only with the performance issues. > > -- View this message in context: http://www.nabble.com/Using-HDFS-to-serve-www-requests-tp22725659p22862098.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: skip setting output path for a sequential MR job..
Removing the file programatically is doing the trick for me. thank you all for your answers and help :-) On Tue, Mar 31, 2009 at 12:25 AM, some speed wrote: > Hello everyone, > > Is it necessary to redirect the ouput of reduce to a file? When I am trying > to run the same M-R job more than once, it throws an error that the output > file already exists. I dont want to use command line args so I hard coded > the file name into the program. > > So, Is there a way , I could delete a file on HDFS programatically? > or can i skip setting a output file path n just have my output print to > console? > or can I just append to an existing file? > > > Any help is appreciated. Thanks. > > -Sharath >
Re: HDFS data block clarification
The last block of an HDFS block only occupies the required space. So a 4k file only consumes 4k on disk. -- Owen On Apr 2, 2009, at 18:44, javateck javateck wrote: Can someone tell whether a file will occupy one or more blocks? for example, the default block size is 64MB, and if I save a 4k file to HDFS, will the 4K file occupy the whole 64MB block alone? so in this case, do I do need to configure the block size to 10k if most of my files are less than 10K? thanks,
Re: HDFS data block clarification
HDFS only allocates as much physical disk space is required for a block, up to the block size for the file (+ some header data). So if you write a 4k file, the single block for that file will be around 4k. If you write a 65M file, there will be two blocks, one of roughly 64M, and one of roughly 1M. You can verify this yourself by, on a datanode, running *find ${dfs.data.dir} -iname blk'*' -type f -ls* Note: the above command will only work as expected if a single directory is defined for dfs block storage, and ${dfs.data.dir}, is replaced with the effective value of the configuration parameter dfs.data.dir, from your hadoop configuration. dfs.data.dir is commonly defined as ${hadoop.tmp.dir}/dfs/data. The following rather insane bash shell command will print out the value of dfs.data.dir on the local machine. It must be run from the hadoop installation directory, and makes 2 temporary names in /tmp/f.PID.input and /tmp/f.PID.output This little ditty relies on the fact that the configuration parameters are pushed into the process environment for streaming jobs. Streaming Rocks! B=/tmp/f.$$; date > ${B}.input; rmdir ${B}.output; bin/hadoop jar contrib/streaming/hadoop-*-streaming.jar -D fs.default.name=file:/// -jt local -input ${B}.input -output ${B}.output -numReduceTasks 0 -mapper env; grep dfs.data.dir ${B}.output/part-0; rm ${B}.input; rm -rf ${B}.output On Thu, Apr 2, 2009 at 6:44 PM, javateck javateck wrote: > Can someone tell whether a file will occupy one or more blocks? for > example, the default block size is 64MB, and if I save a 4k file to HDFS, > will the 4K file occupy the whole 64MB block alone? so in this case, do I > do > need to configure the block size to 10k if most of my files are less than > 10K? > > thanks, > -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
HDFS data block clarification
Can someone tell whether a file will occupy one or more blocks? for example, the default block size is 64MB, and if I save a 4k file to HDFS, will the 4K file occupy the whole 64MB block alone? so in this case, do I do need to configure the block size to 10k if most of my files are less than 10K? thanks,
Re: Checking if a streaming job failed
here is how i do it (in perl). hadoop streaming is actually called by a shell script, which in this case expects compressed input and produces compressed output. but you get the idea: (the mailer had messed-up the formatting somewhat) > sub runStreamingCompInCompOut { my $mapper = shift @_; my $reducer = shift @_; my $inDir = shift @_; my $outDir = shift @_; my $numMappers = shift @_; my $numReducers = shift @_; my $jobName = $runName . ":" . shift @_; my $cmd = "sh runStreamingCompInCompOut.sh $mapper $reducer $inDir $outDir $jobName $numMappers \$numReducers &> /tmp/.trace"; print STDERR "Running: $cmd\n"; system $cmd; open IN, "/tmp/.trace" or die "can't open streaming trace"; while(!eof(IN)){ my $line = ; (my $date,my $time,my $status) = split(/\s+/,$line); if ($status eq "ERROR") { print STDERR "command: $cmd failed\n"; exit(-1); } } } 2009/4/3 Mayuran Yogarajah : > Hello, does anyone know how I can check if a streaming job (in Perl) has > failed or succeeded? The only way I can see at the moment is to check > the web interface for that jobID and parse out the '*Status:*' value. > > Is it not possible to do this using 'hadoop job -status' ? I see there is a > count > for failed map/reduce tasks, but map/reduce tasks failing is normal (or so > I thought). I am under the impression that if a task fails it will simply > be > reassigned to a different node. Is this not the case? If this is normal > then I > can't reliably use this count to check if the job as a whole failed or > succeeded. > > Any feedback is greatly appreciated. > > thanks, > M > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: RPM spec file for 0.19.1
Hey Ian, we are totally fine with this - the only reason we didn't contribute the SPEC file is that it is the output of our internal build system, and we don't have the bandwidth to properly maintain multiple RPMs. That said, we chatted about this a bit today, and were wondering if the community would like us to host RPMs for all releases in our "devel" repository. We can't stand behind these from a reliability angle the same way we can with our "blessed" RPMs, but it's a manageable amount of additional work to have our build system spit those out as well. If you'd like us to do this, please add a "me too" to this page: http://www.getsatisfaction.com/cloudera/topics/should_we_release_host_rpms_for_all_releases We could even skip the branding on the "devel" releases :-) Cheers, Christophe On Thu, Apr 2, 2009 at 12:46 PM, Ian Soboroff wrote: > > I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) > with a spec file for building a 0.19.1 RPM. > > I like the idea of Cloudera's RPM file very much. In particular, it has > nifty /etc/init.d scripts and RPM is nice for managing updates. > However, it's for an older, patched version of Hadoop. > > This spec file is actually just Cloudera's, with suitable edits. The > spec file does not contain an explicit license... if Cloudera have > strong feelings about it, let me know and I'll pull the JIRA attachment. > > The JIRA includes instructions on how to roll the RPMs yourself. I > would have attached the SRPM but they're too big for JIRA. I can offer > noarch RPMs build with this spec file if someone wants to host them. > > Ian > >
Checking if a streaming job failed
Hello, does anyone know how I can check if a streaming job (in Perl) has failed or succeeded? The only way I can see at the moment is to check the web interface for that jobID and parse out the '*Status:*' value. Is it not possible to do this using 'hadoop job -status' ? I see there is a count for failed map/reduce tasks, but map/reduce tasks failing is normal (or so I thought). I am under the impression that if a task fails it will simply be reassigned to a different node. Is this not the case? If this is normal then I can't reliably use this count to check if the job as a whole failed or succeeded. Any feedback is greatly appreciated. thanks, M
Re: Hardware - please sanity check?
> > > I've been assuming that RAID is generally a good idea (disks fail quite > often, and it's cheaper to hotswap a drive than to rebuild an entire box). > Hadoop data nodes are often configured without RAID (i.e., "JBOD" = Just a Bunch of Disks)--HDFS already provides for the data redundancy. Also, if you stripe across disks, you're liable to be as slow as the slowest of your disks, so data nodes are typically configured to point to multiple disks. -- Philip
Question about upgrading
Hello, I have a 5 node cluster with one master node. I am upgrading from 16.4 to 18.3 but am a little confused if i am doing it the right way. I read up on the documentatin and how to use the -upgrade switch but want to make sure i havent missed any step. First i took down the cluster by issuing stop-all.sh on the master node. I installed the new hadoop by untaring the tar ball and then copied the config files from the old setup 16.4/conf/* into 18.3/conf/*. Changed some symlinks to point to the new version. Performed this step on all the master and slave nodes. Then i went on the master and started the master node only with the -upgrade switch using the command in the new hadoop version directory. Waited for everything to go smoothly. No error were reported i didnt change any setting in the config files and just copied from the old version conf directory. Then i started the other data nodes, it should work right. Did i miss anything, i am the sysadmin for this setup so want to make sure i do this right and not have to reformat the file system and cant afford to loose the data for every upgrade. Want to keep the file system as is and upgrade from 16.4 to 18.3. If i missed any important detail or steps please advise. Thanks, Usman
Re: Hardware - please sanity check?
I had a similar curiosity, but more regarding disk speed. Can I assume linear improvement between 7200rpm -> 10k rpm -> 15k rpm? How much of a bottleneck is disk access? Another question is regarding hardware redundancy. What is the relative value of the following: - RAID / hot-swappable drives - dual NICs - redundant backplane - redundant power supply - UPS I've been assuming that RAID is generally a good idea (disks fail quite often, and it's cheaper to hotswap a drive than to rebuild an entire box). Dual NICs are also good, as both can be used at the same time. Everything else is not necessary in a Hadoop cluster. On Thu, Apr 2, 2009 at 11:33 AM, tim robertson wrote: > Thanks Miles, > > Thus far most of my work has been on EC2 large instances and *mostly* > my code is not memory intensive (I sometimes do joins against polygons > and hold Geospatial indexes in memory, but am aware of keeping things > within the -Xmx for this). > I am mostly looking to move routine data processing and > transformation (lots of distinct, count and group by operations) off a > chunky mysql DB (200million rows and growing) which gets all locked > up. > > We have gigabit switches. > > Cheers > > Tim > > > > On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne wrote: > > make sure you also have a fast switch, since you will be transmitting > > data across your network and this will come to bite you otherwise > > > > (roughly, you need one core per hadoop-related job, each mapper, task > > tracker etc; the per-core memory may be too small if you are doing > > anything memory-intensive. we have 8-core boxes with 50 -- 33 GB RAM > > and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM > > and it routinely falls over when we run jobs on it) > > > > Miles > > > > 2009/4/2 tim robertson : > >> Hi all, > >> > >> I am not a hardware guy but about to set up a 10 node cluster for some > >> processing of (mostly) tab files, generating various indexes and > >> researching HBase, Mahout, pig, hive etc. > >> > >> Could someone please sanity check that these specs look sensible? > >> [I know 4 drives would be better but price is a factor (second hand > >> not an option, hosting is not either as there is very good bandwidth > >> provided)] > >> > >> Something along the lines of: > >> > >> Dell R200 (8GB is max memory) > >> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB > >> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs) > >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive > >> > >> > >> Dell R300 (can be expanded to 24GB RAM) > >> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS > >> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs) > >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive > >> > >> > >> If there is a major flaw please can you let me know. > >> > >> Thanks, > >> > >> Tim > >> (not a hardware guy ;o) > >> > > > > > > > > -- > > The University of Edinburgh is a charitable body, registered in > > Scotland, with registration number SC005336. > > >
Lost TaskTracker Errors
Hey Folks, Since last 2-3 days I am seeing many of these errors popping up in our hadoop cluster. Task attempt_200904011612_0025_m_000120_0 failed to report status for 604 seconds. Killing JobTracker logs are doesn¹t have any more info And task tracker logs are clean. The failures occurred with these symptoms 1. Datanodes will start timing out 2. hdfs will get extremely slow (hdfs ls will take like 2 mins Vs 1s in normal mode) The datanode logs on failing tasktracker nodes are filled up with 2009-04-02 11:39:46,828 WARN org.apache.hadoop.dfs.DataNode: DatanodeRegistration(172.16.216.64:50010, storageID=DS-707090154-172.16.216.64-50010-1223506297192, infoPort=50075, ipcPort=50020):Failed to transfer blk_-7774359493260170883_282858 to 172.16.216.62:50010 got java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/172.16.216.64:36689 remote=/172.16.216.62:50010] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java :185) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream. java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream. java:198) at org.apache.hadoop.dfs.DataNode$BlockSender.sendChunks(DataNode.java:1873) at org.apache.hadoop.dfs.DataNode$BlockSender.sendBlock(DataNode.java:1967) at org.apache.hadoop.dfs.DataNode$DataTransfer.run(DataNode.java:2855) at java.lang.Thread.run(Thread.java:619) We are running a 10 Node cluster (hadoop-0.18.1) on Dual Quad core boxes (8G RAM) with these properties 1. mapred.child.java.opts = Xmx600M 2. mapred.tasktracker.map.tasks.maximum = 8 3. mapred.tasktracker.reduce.tasks.maximum = 4 4. dfs.datanode.handler.count = 10 5. dfs.datanode.du.reserved = 10240 6. dfs.datanode.max.xcievers = 512 The map jobs writes a Ton of data for each record, does increasing ³dfs.datanode.handler.count² will help in this case ?? What other configuration change can I try ?? Best Bhupesh
Re: Running MapReduce without setJar
I did all of them i.e. I used setMapClass, setReduceClass and new JobConf(MapReduceWork.class) but still it cannot run the job without a jar file. I understand the reason that it looks for those classes inside a jar but I think there should be some better way to find those classes without using a jar. But I am not sure whether it is possible at all. On Thu, Apr 2, 2009 at 2:56 PM, Rasit OZDAS wrote: > You can point to them by using > conf.setMapClass(..) and conf.setReduceClass(..) - or something > similar, I don't have the source nearby. > > But something weird has happened to my code. It runs locally when I > start it as java process (tries to find input path locally). I'm now > using trunk, maybe something has changed with new version. With > version 0.19 it was fine. > Can somebody point out a clue? > > Rasit > -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: Running MapReduce without setJar
You can point to them by using conf.setMapClass(..) and conf.setReduceClass(..) - or something similar, I don't have the source nearby. But something weird has happened to my code. It runs locally when I start it as java process (tries to find input path locally). I'm now using trunk, maybe something has changed with new version. With version 0.19 it was fine. Can somebody point out a clue? Rasit
Re: Multiple k,v pairs from a single map - possible?
Here's the JIRA for the Oracle fix. https://issues.apache.org/jira/browse/HADOOP-5616 Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Fri, Mar 27, 2009 at 5:18 AM, Brian MacKay wrote: > > Amandeep, > > Add this to your driver. > > MultipleOutputs.addNamedOutput(conf, "PHONE",TextOutputFormat.class, > Text.class, Text.class); > > MultipleOutputs.addNamedOutput(conf, "NAME, >TextOutputFormat.class, Text.class, Text.class); > > > > And in your reducer > > private MultipleOutputs mos; > > public void reduce(Text key, Iterator values, >OutputCollector output, Reporter reporter) { > > > // namedOutPut = either PHONE or NAME > >while (values.hasNext()) { >String value = values.next().toString(); >mos.getCollector(namedOutPut, reporter).collect( >new Text(value), new Text(othervals)); >} >} > >@Override >public void configure(JobConf conf) { >super.configure(conf); >mos = new MultipleOutputs(conf); >} > >public void close() throws IOException { >mos.close(); >} > > > > By the way, have you had a change to post your Oracle fix to > DBInputFormat ? > If so, what is the Jira tag #? > > Brian > > -Original Message- > From: Amandeep Khurana [mailto:ama...@gmail.com] > Sent: Friday, March 27, 2009 5:46 AM > To: core-user@hadoop.apache.org > Subject: Multiple k,v pairs from a single map - possible? > > Is it possible to output multiple key value pairs from a single map > function > run? > > For example, the mapper outputing and > simultaneously... > > Can I write multiple output.collect(...) commands? > > Amandeep > > Amandeep Khurana > Computer Science Graduate Student > University of California, Santa Cruz > > > > > > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ > > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this message in error, please contact the sender and delete the material > from any computer. > > >
RPM spec file for 0.19.1
I created a JIRA (https://issues.apache.org/jira/browse/HADOOP-5615) with a spec file for building a 0.19.1 RPM. I like the idea of Cloudera's RPM file very much. In particular, it has nifty /etc/init.d scripts and RPM is nice for managing updates. However, it's for an older, patched version of Hadoop. This spec file is actually just Cloudera's, with suitable edits. The spec file does not contain an explicit license... if Cloudera have strong feelings about it, let me know and I'll pull the JIRA attachment. The JIRA includes instructions on how to roll the RPMs yourself. I would have attached the SRPM but they're too big for JIRA. I can offer noarch RPMs build with this spec file if someone wants to host them. Ian
Re: Amazon Elastic MapReduce
Kevin, The API accepts any arguments you can pass in the standard jobconf for Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON job description that will run on the service. -Pete On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson wrote: > So if I understand correctly, this is an automated system to bring up a > hadoop cluster on EC2, import some data from S3, run a job flow, write the > data back to S3, and bring down the cluster? > > This seems like a pretty good deal. At the pricing they are offering, > unless > I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be > cheaper to use this new service. > > Does this use an existing Hadoop job control API, or do I need to write my > flows to conform to Amazon's API? > -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com http://delicious.com/pskomoroch http://twitter.com/peteskomoroch
Re: HELP: I wanna store the output value into a list not write to the disk
It seems like the InMemoryFileSystem class has been deprecated in Hadoop 0.19.1. Why? I want to reuse the result of reduce as the next time map's input. Cascading does not work, because the data of each step is dependent. I set each timestep mapreduce job as synchronization. If the InMemoryFileSystem is deprecated. How can I reduce the I/O for each timestep's mapreduce job. 2009/4/2 Farhan Husain > Is there a way to implement some OutputCollector that can do what Andy > wants > to do? > > On Thu, Apr 2, 2009 at 10:21 AM, Rasit OZDAS wrote: > > > Andy, I didn't try this feature. But I know that Yahoo had a > > performance record with this file format. > > I came across a file system included in hadoop code (probably that > > one) when searching the source code. > > Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem > > But if you have a lot of big files, this approach won't be suitable I > > think. > > > > Maybe someone can give further info. > > > > 2009/4/2 andy2005cst : > > > > > > thanks for your reply. Let me explain more clearly, since Map Reduce is > > just > > > one step of my program, I need to use the output of reduce for furture > > > computation, so i do not need to want to wirte the output into disk, > but > > > wanna to get the collection or list of the output in RAM. if it > directly > > > wirtes into disk, I have to read it back into RAM again. > > > you have mentioned a special file format, will you please show me what > is > > > it? and give some example if possible. > > > > > > thank you so much. > > > > > > > > > Rasit OZDAS wrote: > > >> > > >> Hi, hadoop is normally designed to write to disk. There are a special > > file > > >> format, which writes output to RAM instead of disk. > > >> But I don't have an idea if it's what you're looking for. > > >> If what you said exists, there should be a mechanism which sends > output > > as > > >> objects rather than file content across computers, as far as I know > > there > > >> is > > >> no such feature yet. > > >> > > >> Good luck. > > >> > > >> 2009/4/2 andy2005cst > > >> > > >>> > > >>> I need to use the output of the reduce, but I don't know how to do. > > >>> use the wordcount program as an example if i want to collect the > > >>> wordcount > > >>> into a hashtable for further use, how can i do? > > >>> the example just show how to let the result onto disk. > > >>> myemail is : andy2005...@gmail.com > > >>> looking forward your help. thanks a lot. > > >>> -- > > >>> View this message in context: > > >>> > > > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html > > >>> Sent from the Hadoop core-user mailing list archive at Nabble.com. > > >>> > > >>> > > >> > > >> > > >> -- > > >> M. Raşit ÖZDAŞ > > >> > > >> > > > > > > -- > > > View this message in context: > > > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html > > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > > > > > > > > -- > > M. Raşit ÖZDAŞ > > > > > > -- > Mohammad Farhan Husain > Research Assistant > Department of Computer Science > Erik Jonsson School of Engineering and Computer Science > University of Texas at Dallas > -- Chen He RCF CSE Dept. University of Nebraska-Lincoln US
Re: Amazon Elastic MapReduce
So if I understand correctly, this is an automated system to bring up a hadoop cluster on EC2, import some data from S3, run a job flow, write the data back to S3, and bring down the cluster? This seems like a pretty good deal. At the pricing they are offering, unless I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be cheaper to use this new service. Does this use an existing Hadoop job control API, or do I need to write my flows to conform to Amazon's API?
Re: HELP: I wanna store the output value into a list not write to the disk
Is there a way to implement some OutputCollector that can do what Andy wants to do? On Thu, Apr 2, 2009 at 10:21 AM, Rasit OZDAS wrote: > Andy, I didn't try this feature. But I know that Yahoo had a > performance record with this file format. > I came across a file system included in hadoop code (probably that > one) when searching the source code. > Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem > But if you have a lot of big files, this approach won't be suitable I > think. > > Maybe someone can give further info. > > 2009/4/2 andy2005cst : > > > > thanks for your reply. Let me explain more clearly, since Map Reduce is > just > > one step of my program, I need to use the output of reduce for furture > > computation, so i do not need to want to wirte the output into disk, but > > wanna to get the collection or list of the output in RAM. if it directly > > wirtes into disk, I have to read it back into RAM again. > > you have mentioned a special file format, will you please show me what is > > it? and give some example if possible. > > > > thank you so much. > > > > > > Rasit OZDAS wrote: > >> > >> Hi, hadoop is normally designed to write to disk. There are a special > file > >> format, which writes output to RAM instead of disk. > >> But I don't have an idea if it's what you're looking for. > >> If what you said exists, there should be a mechanism which sends output > as > >> objects rather than file content across computers, as far as I know > there > >> is > >> no such feature yet. > >> > >> Good luck. > >> > >> 2009/4/2 andy2005cst > >> > >>> > >>> I need to use the output of the reduce, but I don't know how to do. > >>> use the wordcount program as an example if i want to collect the > >>> wordcount > >>> into a hashtable for further use, how can i do? > >>> the example just show how to let the result onto disk. > >>> myemail is : andy2005...@gmail.com > >>> looking forward your help. thanks a lot. > >>> -- > >>> View this message in context: > >>> > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html > >>> Sent from the Hadoop core-user mailing list archive at Nabble.com. > >>> > >>> > >> > >> > >> -- > >> M. Raşit ÖZDAŞ > >> > >> > > > > -- > > View this message in context: > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > > > > -- > M. Raşit ÖZDAŞ > -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: A bizarre problem in reduce method
Thanks Rasit for your suggestion. Actually, I should have let the group know earlier that I solved the problem and it had nothing to do with the reduce method. I used my reducer class as the combiner too which is not appropriate in this case. I just got rid of the combiner and everything works fine now. I think the Map/Reduce tutorial in hadoop's website should talk more about the combiner. In the word count example the reducer can work as a combiner but not in all other problems. This should be highlighted a little bit more in the tutorial. On Thu, Apr 2, 2009 at 8:50 AM, Rasit OZDAS wrote: > Hi, Husain, > > 1. You can use a boolean control in your code. > boolean hasAlreadyOned = false; >int iCount = 0; > String sValue; > while (values.hasNext()) { > sValue = values.next().toString(); > iCount++; >if (sValue.equals("1")) > hasAlreadyOned = true; > > if (!hasAlreadyOned) > sValues += "\t" + sValue; > } > ... > > 2. You're actually controlling for 3 elements, not 2. You should use if > (iCount == 1) > > 2009/4/1 Farhan Husain > > > Hello All, > > > > I am facing some problems with a reduce method I have written which I > > cannot > > understand. Here is the method: > > > >@Override > >public void reduce(Text key, Iterator values, > > OutputCollector output, Reporter reporter) > >throws IOException { > >String sValues = ""; > >int iCount = 0; > >String sValue; > >while (values.hasNext()) { > >sValue = values.next().toString(); > >iCount++; > >sValues += "\t" + sValue; > > > >} > >sValues += "\t" + iCount; > >//if (iCount == 2) > >output.collect(key, new Text(sValues)); > >} > > > > The output of the code is like the following: > > > > D0U0:GraduateStudent0lehigh:GraduateStudent11 > 1 > > D0U0:GraduateStudent1lehigh:GraduateStudent11 > 1 > > D0U0:GraduateStudent10lehigh:GraduateStudent11 > 1 > > D0U0:GraduateStudent100lehigh:GraduateStudent11 > > 1 > > D0U0:GraduateStudent101lehigh:GraduateStudent1 > > D0U0:GraduateCourse0121 > > D0U0:GraduateStudent102lehigh:GraduateStudent11 > > 1 > > D0U0:GraduateStudent103lehigh:GraduateStudent11 > > 1 > > D0U0:GraduateStudent104lehigh:GraduateStudent11 > > 1 > > D0U0:GraduateStudent105lehigh:GraduateStudent11 > > 1 > > > > The problem is there cannot be so many 1's in the output value. The > output > > which I expect should be like this: > > > > D0U0:GraduateStudent0lehigh:GraduateStudent1 > > D0U0:GraduateStudent1lehigh:GraduateStudent1 > > D0U0:GraduateStudent10lehigh:GraduateStudent1 > > D0U0:GraduateStudent100lehigh:GraduateStudent1 > > D0U0:GraduateStudent101lehigh:GraduateStudent > > D0U0:GraduateCourse02 > > D0U0:GraduateStudent102lehigh:GraduateStudent1 > > D0U0:GraduateStudent103lehigh:GraduateStudent1 > > D0U0:GraduateStudent104lehigh:GraduateStudent1 > > D0U0:GraduateStudent105lehigh:GraduateStudent1 > > > > If I do not append the iCount variable to sValues string, I get the > > following output: > > > > D0U0:GraduateStudent0lehigh:GraduateStudent > > D0U0:GraduateStudent1lehigh:GraduateStudent > > D0U0:GraduateStudent10lehigh:GraduateStudent > > D0U0:GraduateStudent100lehigh:GraduateStudent > > D0U0:GraduateStudent101lehigh:GraduateStudent > > D0U0:GraduateCourse0 > > D0U0:GraduateStudent102lehigh:GraduateStudent > > D0U0:GraduateStudent103lehigh:GraduateStudent > > D0U0:GraduateStudent104lehigh:GraduateStudent > > D0U0:GraduateStudent105lehigh:GraduateStudent > > > > This confirms that there is no 1's after each of those values (which I > > already know from the intput data). I do not know why the output is > > distorted like that when I append the iCount to sValues (like the given > > code). Can anyone help in this regard? > > > > Now comes the second problem which is equally perplexing. Actually, the > > reduce method which I want to run is like the following: > > > >@Override > >public void reduce(Text key, Iterator values, > > OutputCollector output, Reporter reporter) > >throws IOException { > >String sValues = ""; > >int iCount = 0; > >String sValue; > >while (values.hasNext()) { > >sValue = values.next().toString(); > >iCount++; > >sValues += "\t" + sValue; > > > >} > >s
Re: Running MapReduce without setJar
Does this class need to have the mapper and reducer classes too? On Wed, Apr 1, 2009 at 1:52 PM, javateck javateck wrote: > you can run from java program: > >JobConf conf = new JobConf(MapReduceWork.class); > >// setting your params > >JobClient.runJob(conf); > > > On Wed, Apr 1, 2009 at 11:42 AM, Farhan Husain wrote: > > > Can I get rid of the whole jar thing? Is there any way to run map reduce > > programs without using a jar? I do not want to use "hadoop jar ..." > either. > > > > On Wed, Apr 1, 2009 at 1:10 PM, javateck javateck > >wrote: > > > > > I think you need to set a property (mapred.jar) inside hadoop-site.xml, > > > then > > > you don't need to hardcode in your java code, and it will be fine. > > > But I don't know if there is any way that we can set multiple jars, > since > > a > > > lot of times our own mapreduce class needs to reference other jars. > > > > > > On Wed, Apr 1, 2009 at 10:57 AM, Farhan Husain > > wrote: > > > > > > > Hello, > > > > > > > > Can anyone tell me if there is any way running a map-reduce job from > a > > > java > > > > program without specifying the jar file by JobConf.setJar() method? > > > > > > > > Thanks, > > > > > > > > -- > > > > Mohammad Farhan Husain > > > > Research Assistant > > > > Department of Computer Science > > > > Erik Jonsson School of Engineering and Computer Science > > > > University of Texas at Dallas > > > > > > > > > > > > > > > -- > > Mohammad Farhan Husain > > Research Assistant > > Department of Computer Science > > Erik Jonsson School of Engineering and Computer Science > > University of Texas at Dallas > > > -- Mohammad Farhan Husain Research Assistant Department of Computer Science Erik Jonsson School of Engineering and Computer Science University of Texas at Dallas
Re: HELP: I wanna store the output value into a list not write to the disk
I don't really see what the downside of reading it from disk is. A list of word counts should be pretty small on disk so it shouldn't take long to read it into a HashMap. Doing anything else is going to cause you to go a long way out of your way to end up with the same result. -Bryan On Apr 2, 2009, at 2:41 AM, andy2005cst wrote: I need to use the output of the reduce, but I don't know how to do. use the wordcount program as an example if i want to collect the wordcount into a hashtable for further use, how can i do? the example just show how to let the result onto disk. myemail is : andy2005...@gmail.com looking forward your help. thanks a lot. -- View this message in context: http://www.nabble.com/HELP%3A-I-wanna- store-the-output-value-into-a-list-not-write-to-the-disk- tp22844277p22844277.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HELP: I wanna store the output value into a list not write to the disk
That seems interesting, we have 3 replications as default. Is there a way to define, lets say, 1 replication for only job-specific files? 2009/4/2 Owen O'Malley : > > On Apr 2, 2009, at 2:41 AM, andy2005cst wrote: > >> >> I need to use the output of the reduce, but I don't know how to do. >> use the wordcount program as an example if i want to collect the wordcount >> into a hashtable for further use, how can i do? > > You can use an output format and then an input format that uses a database, > but in practice, the cost of writing to hdfs and reading it back is not a > problem, especially if you set the replication of the output files to 1. > (You'll need to re-run the job if you lose a node, but it will be fast.) > > -- Owen > -- M. Raşit ÖZDAŞ
Re: HELP: I wanna store the output value into a list not write to the disk
On Apr 2, 2009, at 2:41 AM, andy2005cst wrote: I need to use the output of the reduce, but I don't know how to do. use the wordcount program as an example if i want to collect the wordcount into a hashtable for further use, how can i do? You can use an output format and then an input format that uses a database, but in practice, the cost of writing to hdfs and reading it back is not a problem, especially if you set the replication of the output files to 1. (You'll need to re-run the job if you lose a node, but it will be fast.) -- Owen
Re: HadoopConfig problem -Datanode not able to connect to the server
I have no idea, but there are many "use hostname instead of IP" issues. Try once hostname instead of IP. 2009/3/26 mingyang : > check you iptable is off > > 2009/3/26 snehal nagmote > >> hello, >> We configured hadoop successfully, but after some days its configuration >> file from datanode( hadoop-site.xml) went off , and datanode was not coming >> up ,so we again did the same configuration, its showing one datanode and >> its >> name as localhost rather than expected as either name of respected datanode >> m/c or ip address of actual datanode in ui interfece of hadoop. >> >> But capacity as 80.0gb ,(we have one namenode (40 gb) and datanode(40 >> gb))means capacity is updated ,we can browse the filesystem , it is showing >> whatever directories we are creating in namenode . >> >> but when we try to access the same through the datanode machine >> means doing ssh and executing series of commands its not able to connect to >> the server. >> saying retrying connect to the server >> >> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: / >> 172.16.6.102:21011. Already tried 0 time(s). >> >> 09/03/26 11:25:11 INFO ipc.Client: Retrying connect to server: / >> 172.16.6.102:21011. Already tried 1 time(s) >> >> >> moreover we added one datanode into it and formatted namenode ,but that >> datanode is not getting added. we are not understanding whats the problem. >> >> Can configuration files in case of datanode automatcally lost after some >> days?? >> >> I have again one doubt , according to my understanding namenode doesnt >> store >> any data , it stores metadata of all the data , so when i execute mkdir in >> namenode machine and copying some files into it, it means that data is >> getting stored in datanode provided to it, please correct me if i am wrong >> , >> i am very new to hadoop. >> So if i am able to view the data through inteface means its properly >> storing >> data into respected datanode, So >> why its showing localhost as datanode name rather than respected datanode >> name. >> >> can you please help. >> >> >> Regards, >> Snehal Nagmote >> IIIT hyderabad >> > > > > -- > 致 > 礼! > > > 王明阳 > -- M. Raşit ÖZDAŞ
Re: Hardware - please sanity check?
Thanks Miles, Thus far most of my work has been on EC2 large instances and *mostly* my code is not memory intensive (I sometimes do joins against polygons and hold Geospatial indexes in memory, but am aware of keeping things within the -Xmx for this). I am mostly looking to move routine data processing and transformation (lots of distinct, count and group by operations) off a chunky mysql DB (200million rows and growing) which gets all locked up. We have gigabit switches. Cheers Tim On Thu, Apr 2, 2009 at 4:15 PM, Miles Osborne wrote: > make sure you also have a fast switch, since you will be transmitting > data across your network and this will come to bite you otherwise > > (roughly, you need one core per hadoop-related job, each mapper, task > tracker etc; the per-core memory may be too small if you are doing > anything memory-intensive. we have 8-core boxes with 50 -- 33 GB RAM > and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM > and it routinely falls over when we run jobs on it) > > Miles > > 2009/4/2 tim robertson : >> Hi all, >> >> I am not a hardware guy but about to set up a 10 node cluster for some >> processing of (mostly) tab files, generating various indexes and >> researching HBase, Mahout, pig, hive etc. >> >> Could someone please sanity check that these specs look sensible? >> [I know 4 drives would be better but price is a factor (second hand >> not an option, hosting is not either as there is very good bandwidth >> provided)] >> >> Something along the lines of: >> >> Dell R200 (8GB is max memory) >> Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB >> 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs) >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive >> >> >> Dell R300 (can be expanded to 24GB RAM) >> Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS >> 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs) >> 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive >> >> >> If there is a major flaw please can you let me know. >> >> Thanks, >> >> Tim >> (not a hardware guy ;o) >> > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. >
Re: HELP: I wanna store the output value into a list not write to the disk
Andy, I didn't try this feature. But I know that Yahoo had a performance record with this file format. I came across a file system included in hadoop code (probably that one) when searching the source code. Luckily I found it: org.apache.hadoop.fs.InMemoryFileSystem But if you have a lot of big files, this approach won't be suitable I think. Maybe someone can give further info. 2009/4/2 andy2005cst : > > thanks for your reply. Let me explain more clearly, since Map Reduce is just > one step of my program, I need to use the output of reduce for furture > computation, so i do not need to want to wirte the output into disk, but > wanna to get the collection or list of the output in RAM. if it directly > wirtes into disk, I have to read it back into RAM again. > you have mentioned a special file format, will you please show me what is > it? and give some example if possible. > > thank you so much. > > > Rasit OZDAS wrote: >> >> Hi, hadoop is normally designed to write to disk. There are a special file >> format, which writes output to RAM instead of disk. >> But I don't have an idea if it's what you're looking for. >> If what you said exists, there should be a mechanism which sends output as >> objects rather than file content across computers, as far as I know there >> is >> no such feature yet. >> >> Good luck. >> >> 2009/4/2 andy2005cst >> >>> >>> I need to use the output of the reduce, but I don't know how to do. >>> use the wordcount program as an example if i want to collect the >>> wordcount >>> into a hashtable for further use, how can i do? >>> the example just show how to let the result onto disk. >>> myemail is : andy2005...@gmail.com >>> looking forward your help. thanks a lot. >>> -- >>> View this message in context: >>> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html >>> Sent from the Hadoop core-user mailing list archive at Nabble.com. >>> >>> >> >> >> -- >> M. Raşit ÖZDAŞ >> >> > > -- > View this message in context: > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- M. Raşit ÖZDAŞ
Re: Identify the input file for a failed mapper/reducer
Two quotes for this problem: "Streaming map tasks should have a "map_input_file" environment variable like the following: map_input_file=hdfs://HOST/path/to/file" "the value for map.input.file gives you the exact information you need." (didn't try) Rasit 2009/3/26 Jason Fennell : > Is there a way to identify the input file a mapper was running on when > it failed? When a large job fails because of bad input lines I have > to resort to rerunning the entire job to isolate a single bad line > (since the log doesn't contain information on the file that that > mapper was running on). > > Basically, I would like to be able to do one of the following: > 1. Find the file that a mapper was running on when it failed > 2. Find the block that a mapper was running on when it failed (and be > able to find file names from block ids) > > I haven't been able to find any documentation on facilities to > accomplish either (1) or (2), so I'm hoping someone on this list will > have a suggestion. > > I am using the Hadoop streaming API on hadoop 0.18.2. > > -Jason > -- M. Raşit ÖZDAŞ
Re: Amazon Elastic MapReduce
You should check out the new pricing. On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne wrote: ... and only in the US Miles 2009/4/2 zhang jianfeng : Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Re: Join Variation
Probably be available in a week or so, as draft one isn't quite finished :) On Thu, Apr 2, 2009 at 1:45 AM, Stefan Podkowinski wrote: > .. and is not yet available as an alpha book chapter. Any chance uploading > it? > > On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop > wrote: > > Just for fun, chapter 9 in my book is a work through of solving this > class > > of problem. > > > > > > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop >wrote: > > > >> For the classic map/reduce job, you have 3 requirements. > >> > >> 1) a comparator that provide the keys in ip address order, such that all > >> keys in one of your ranges, would be contiguous, when sorted with the > >> comparator > >> 2) a partitioner that ensures that all keys that should be together end > up > >> in the same partition > >> 3) and output value grouping comparator that considered all keys in a > >> specified range equal. > >> > >> The comparator only sorts by the first part of the key, the search file > has > >> a 2 part key begin/end the input data has just a 1 part key. > >> > >> A partitioner that new ahead of time the group sets in your search set, > in > >> the way that the tera sort example works would be ideal: > >> ie: it builds an index of ranges from your seen set so that the ranges > get > >> rougly evenly split between your reduces. > >> This requires a pass over the search file to write out a summary file, > >> which is then loaded by the partitioner. > >> > >> The output value grouping comparator, will get the keys in order of the > >> first token, and will define the start of a group by the presence of a 2 > >> part key, and consider the group ended when either another 2 part key > >> appears, or when the key value is larger than the second part of the > >> starting key. - This does require that the grouping comparator maintain > >> state. > >> > >> At this point, your reduce will be called with the first key in the key > >> equivalence group of (3), with the values of all of the keys > >> > >> In your map, any address that is not in a range of interest is not > passed > >> to output.collect. > >> > >> For the map side join code, you have to define a comparator on the key > type > >> that defines your definition of equivalence and ordering, and call > >> WritableComparator.define( Key.class, comparator.class ), to force the > join > >> code to use your comparator. > >> > >> For tables with duplicates, per the key comparator, in map side join, > your > >> map fuction will receive a row for every permutation of the duplicate > keys: > >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; > your > >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4; > >> > >> > >> > >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara >wrote: > >> > >>> Thanks for all who replies. > >>> > >>> Stefan - > >>> I'm unable to see how converting IP ranges to network masks would help > >>> because different ranges can have the same network mask and with that I > >>> still have to do a comparison of two fields: the searched IP with > >>> from-IP&mask. > >>> > >>> Pig - I'm familier with pig and use it many times, but I can't think of > a > >>> way to write a pig script that will do this type of "join". I'll ask > the > >>> pig > >>> users group. > >>> > >>> The search file is indeed large in terms of the amount records. > However, I > >>> don't see this as an issue yet, because I'm still puzzeled with how to > >>> write > >>> the job in plain MR. The join code is looking for an exact match in the > >>> keys > >>> and that is not what I need. Would a custom comperator which will look > for > >>> a > >>> match in between the ranges, be the right choice to do this ? > >>> > >>> Thanks, > >>> Tamir > >>> > >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop >>> >wrote: > >>> > >>> > If the search file data set is large, the issue becomes ensuring that > >>> only > >>> > the required portion of search file is actually read, and that those > >>> reads > >>> > are ordered, in search file's key order. > >>> > > >>> > If the data set is small, most any of the common patterns will work. > >>> > > >>> > I haven't looked at pig for a while, does pig now use indexes in map > >>> files, > >>> > and take into account that a data set is sorted? > >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join > >>> will > >>> > do a decent job of this, but the entire search file set will be read. > >>> > To stop reading the entire search file, a record reader or join type, > >>> would > >>> > need to be put together to: > >>> > a) skip to the first key of interest, using the index if available > >>> > b) finish when the last possible key of interest has been delivered. > >>> > > >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee > >>> wrote: > >>> > > >>> > > In addition to other suggestions, you could also take a look at > >>> > > building a Cascading job with a custom Joiner class. > >>> > > > >>> > > - John > >>> > > > >>> > > On Tue, M
Re: Hardware - please sanity check?
make sure you also have a fast switch, since you will be transmitting data across your network and this will come to bite you otherwise (roughly, you need one core per hadoop-related job, each mapper, task tracker etc; the per-core memory may be too small if you are doing anything memory-intensive. we have 8-core boxes with 50 -- 33 GB RAM and 8 x 1 TB disks on each one; one box however just has 16 GB of RAM and it routinely falls over when we run jobs on it) Miles 2009/4/2 tim robertson : > Hi all, > > I am not a hardware guy but about to set up a 10 node cluster for some > processing of (mostly) tab files, generating various indexes and > researching HBase, Mahout, pig, hive etc. > > Could someone please sanity check that these specs look sensible? > [I know 4 drives would be better but price is a factor (second hand > not an option, hosting is not either as there is very good bandwidth > provided)] > > Something along the lines of: > > Dell R200 (8GB is max memory) > Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB > 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs) > 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive > > > Dell R300 (can be expanded to 24GB RAM) > Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS > 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs) > 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive > > > If there is a major flaw please can you let me know. > > Thanks, > > Tim > (not a hardware guy ;o) > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: hdfs-doubt
It seems that either NameNode or DataNode is not started. You can take a look at log files, and paste related lines here. 2009/3/29 deepya : > > Thanks, > > I have another doubt.I just want to run the examples and see how it works.I > am trying to copy the file from local file system to hdfs using the command > > bin/hadoop fs -put conf input > > It is giving the following error. > 09/03/29 05:50:54 INFO hdfs.DFSClient: Exception in createBlockOutputStream > java.net.NoRouteToHostException: No route to host > 09/03/29 05:50:54 INFO hdfs.DFSClient: Abandoning block > blk_-5733385806393158149_1053 > > I have only one datanode in my cluster and my replication factor is also > 1(as configured in the conf file in hadoop-site.xml).Can you please provide > the solution for this. > > > Thanks in advance > > SreeDeepya > > > sree deepya wrote: >> >> Hi sir/madam, >> >> I am SreeDeepya,doing Mtech in IIIT.I am working on a project named cost >> effective and scalable storage server.Our main goal of the project is to >> be >> able to store images in a server and the data can be upto petabytes.For >> that >> we are using HDFS.I am new to hadoop and am just learning about it. >> Can you please clarify some of the doubts I have. >> >> >> >> At present we configured one datanode and one namenode.Jobtracker is >> running >> on namenode and tasktracker on datanode.Now namenode also acts as >> client.Like we are writing programs in the namenode to store or retrieve >> images.My doubts are >> >> 1.Can we put the client and namenode in two separate systems? >> >> 2.Can we access the images from the datanode of hadoop cluster from a >> machine in which hdfs is not there? >> >> 3.At present we may not have data upto petabytes but will be in >> gigabytes.Is >> hadoop still efficient in storing mega and giga bytes of data >> >> >> Thanking you, >> >> Yours sincerely, >> SreeDeepya >> >> > > -- > View this message in context: > http://www.nabble.com/hdfs-doubt-tp22764502p22765332.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- M. Raşit ÖZDAŞ
Re: a doubt regarding an appropriate file system
I doubt If I understood you correctly, but if so, there is a previous thread to better understand what hadoop is intended to be, and what disadvantages it has: http://www.nabble.com/Using-HDFS-to-serve-www-requests-td22725659.html 2009/4/2 Rasit OZDAS > > If performance is important to you, Look at the quote from a previous thread: > > "HDFS is a file system for distributed storage typically for distributed > computing scenerio over hadoop. For office purpose you will require a SAN > (Storage Area Network) - an architecture to attach remote computer storage > devices to servers in such a way that, to the operating system, the devices > appear as locally attached. Or you can even go for AmazonS3, if the data is > really authentic. For opensource solution related to SAN, you can go with > any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS + > zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac > Server + XSan." > > --nitesh > > Besides, I wouldn't use HDFS for this purpose. > > Rasit -- M. Raşit ÖZDAŞ
Hardware - please sanity check?
Hi all, I am not a hardware guy but about to set up a 10 node cluster for some processing of (mostly) tab files, generating various indexes and researching HBase, Mahout, pig, hive etc. Could someone please sanity check that these specs look sensible? [I know 4 drives would be better but price is a factor (second hand not an option, hosting is not either as there is very good bandwidth provided)] Something along the lines of: Dell R200 (8GB is max memory) Quad Core Intel® Xeon® X3360, 2.83GHz, 2x6MB Cache, 1333MHz FSB 8GB Memory, DDR2, 800MHz (4x2GB Dual Ranked DIMMs) 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive Dell R300 (can be expanded to 24GB RAM) Quad Core Intel® Xeon® X3363, 2.83GHz, 2x6M Cache, 1333MHz FS 8GB Memory, DDR2, 667MHz (2x4GB Dual Ranked DIMMs) 2x 500GB 7.200 rpm 3.5-inch SATA Hard Drive If there is a major flaw please can you let me know. Thanks, Tim (not a hardware guy ;o)
Re: a doubt regarding an appropriate file system
If performance is important to you, Look at the quote from a previous thread: "HDFS is a file system for distributed storage typically for distributed computing scenerio over hadoop. For office purpose you will require a SAN (Storage Area Network) - an architecture to attach remote computer storage devices to servers in such a way that, to the operating system, the devices appear as locally attached. Or you can even go for AmazonS3, if the data is really authentic. For opensource solution related to SAN, you can go with any of the linux server distributions (eg. RHEL, SuSE) or Solaris (ZFS + zones) or perhaps best plug-n-play solution (non-open-source) would be a Mac Server + XSan." --nitesh Besides, I wouldn't use HDFS for this purpose. Rasit
Re: A bizarre problem in reduce method
Hi, Husain, 1. You can use a boolean control in your code. boolean hasAlreadyOned = false; int iCount = 0; String sValue; while (values.hasNext()) { sValue = values.next().toString(); iCount++; if (sValue.equals("1")) hasAlreadyOned = true; if (!hasAlreadyOned) sValues += "\t" + sValue; } ... 2. You're actually controlling for 3 elements, not 2. You should use if (iCount == 1) 2009/4/1 Farhan Husain > Hello All, > > I am facing some problems with a reduce method I have written which I > cannot > understand. Here is the method: > >@Override >public void reduce(Text key, Iterator values, > OutputCollector output, Reporter reporter) >throws IOException { >String sValues = ""; >int iCount = 0; >String sValue; >while (values.hasNext()) { >sValue = values.next().toString(); >iCount++; >sValues += "\t" + sValue; > >} >sValues += "\t" + iCount; >//if (iCount == 2) >output.collect(key, new Text(sValues)); >} > > The output of the code is like the following: > > D0U0:GraduateStudent0lehigh:GraduateStudent111 > D0U0:GraduateStudent1lehigh:GraduateStudent111 > D0U0:GraduateStudent10lehigh:GraduateStudent111 > D0U0:GraduateStudent100lehigh:GraduateStudent11 > 1 > D0U0:GraduateStudent101lehigh:GraduateStudent1 > D0U0:GraduateCourse0121 > D0U0:GraduateStudent102lehigh:GraduateStudent11 > 1 > D0U0:GraduateStudent103lehigh:GraduateStudent11 > 1 > D0U0:GraduateStudent104lehigh:GraduateStudent11 > 1 > D0U0:GraduateStudent105lehigh:GraduateStudent11 > 1 > > The problem is there cannot be so many 1's in the output value. The output > which I expect should be like this: > > D0U0:GraduateStudent0lehigh:GraduateStudent1 > D0U0:GraduateStudent1lehigh:GraduateStudent1 > D0U0:GraduateStudent10lehigh:GraduateStudent1 > D0U0:GraduateStudent100lehigh:GraduateStudent1 > D0U0:GraduateStudent101lehigh:GraduateStudent > D0U0:GraduateCourse02 > D0U0:GraduateStudent102lehigh:GraduateStudent1 > D0U0:GraduateStudent103lehigh:GraduateStudent1 > D0U0:GraduateStudent104lehigh:GraduateStudent1 > D0U0:GraduateStudent105lehigh:GraduateStudent1 > > If I do not append the iCount variable to sValues string, I get the > following output: > > D0U0:GraduateStudent0lehigh:GraduateStudent > D0U0:GraduateStudent1lehigh:GraduateStudent > D0U0:GraduateStudent10lehigh:GraduateStudent > D0U0:GraduateStudent100lehigh:GraduateStudent > D0U0:GraduateStudent101lehigh:GraduateStudent > D0U0:GraduateCourse0 > D0U0:GraduateStudent102lehigh:GraduateStudent > D0U0:GraduateStudent103lehigh:GraduateStudent > D0U0:GraduateStudent104lehigh:GraduateStudent > D0U0:GraduateStudent105lehigh:GraduateStudent > > This confirms that there is no 1's after each of those values (which I > already know from the intput data). I do not know why the output is > distorted like that when I append the iCount to sValues (like the given > code). Can anyone help in this regard? > > Now comes the second problem which is equally perplexing. Actually, the > reduce method which I want to run is like the following: > >@Override >public void reduce(Text key, Iterator values, > OutputCollector output, Reporter reporter) >throws IOException { >String sValues = ""; >int iCount = 0; >String sValue; >while (values.hasNext()) { >sValue = values.next().toString(); >iCount++; >sValues += "\t" + sValue; > >} >sValues += "\t" + iCount; >if (iCount == 2) >output.collect(key, new Text(sValues)); >} > > I want to output only if "values" contained only two elements. By looking > at > the output above you can see that there is at least one such key values > pair > where values have exactly two elements. But when I run the code I get an > empty output file. Can anyone solve this? > > I have tried many versions of the code (e.g. using StringBuffer instead of > String, using flags instead of integer count) but nothing works. Are these > problems due to bugs in Hadoop? Please let me know any kind of solution you > can think of. > > Thanks, > > -- > Mohammad Farhan Husain > Research Assistant > Department of Computer Science > Erik Jonsson School of Engineering and Computer Science > University of T
Re: what change to be done in OutputCollector to print custom writable object
There is also a good alternative, We use ObjectInputFormat and ObjectRecordReader. With it you can easily do File <-> Object translations. I can send a code sample to your mail if you want.
Re: Cannot resolve Datonode address in slave file
you should append id_dsa.pub to ~/.ssh/authorized_keys on the other computers from the cluster. if your home directory is shared by all of them (e.g., you're mounting /home/$user using NFS), "cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys" might work. however, if it isn't shared, you might use 'ssh-copy-id' to all your nodes (or append id_dsa.pub manually). 2009/4/2 Puri, Aseem > Hi Rasit, > > Now I got a different problem when I start my Hadoop server the slave > datanode do not accept password. It gives message permission denied. > >I have also use the commands on all m/c > > $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa > $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys > > But my problem is not solved. Any suggestion? > > > -Original Message- > From: Rasit OZDAS [mailto:rasitoz...@gmail.com] > Sent: Thursday, April 02, 2009 6:49 PM > To: core-user@hadoop.apache.org > Subject: Re: Cannot resolve Datonode address in slave file > > Hi, Sim, > > I've two suggessions, if you haven't done yet: > > 1. Check if your other hosts can ssh to master. > 2. Take a look at logs of other hosts. > > 2009/4/2 Puri, Aseem > > > > > Hi > > > >I have a small Hadoop cluster with 3 machines. One is my > > NameNode/JobTracker + DataNode/TaskTracker and other 2 are > > DataNode/TaskTracker. So I have made all 3 as slave. > > > > > > > > In slave file I have put names of all there machines as: > > > > > > > > master > > > > slave > > > > slave1 > > > > > > > > When I start Hadoop cluster it always start DataNode/TaskTracker on last > > slave in the list and do not start DataNode/TaskTracker on other two > > machines. Also I got the message as: > > > > > > > > slave1: > > > > : no address associated with name > > > > : no address associated with name > > > > slave1: starting datanode, logging to > > /home/HadoopAdmin/hadoop/bin/../logs/hadoo > > > > p-HadoopAdmin-datanode-ie11dtxpficbfise.out > > > > > > > > If I change the order in slave file like this: > > > > > > > > slave > > > > slave1 > > > > master > > > > > > > > then DataNode/TaskTracker on master m/c starts and not on other two. > > > > > > > > Please tell how I should solve this problem. > > > > > > > > Sim > > > > > > > -- > M. Raşit ÖZDAŞ > -- Guilherme msn: guigermog...@hotmail.com homepage: http://germoglio.googlepages.com
Re: HELP: I wanna store the output value into a list not write to the disk
thanks for your reply. Let me explain more clearly, since Map Reduce is just one step of my program, I need to use the output of reduce for furture computation, so i do not need to want to wirte the output into disk, but wanna to get the collection or list of the output in RAM. if it directly wirtes into disk, I have to read it back into RAM again. you have mentioned a special file format, will you please show me what is it? and give some example if possible. thank you so much. Rasit OZDAS wrote: > > Hi, hadoop is normally designed to write to disk. There are a special file > format, which writes output to RAM instead of disk. > But I don't have an idea if it's what you're looking for. > If what you said exists, there should be a mechanism which sends output as > objects rather than file content across computers, as far as I know there > is > no such feature yet. > > Good luck. > > 2009/4/2 andy2005cst > >> >> I need to use the output of the reduce, but I don't know how to do. >> use the wordcount program as an example if i want to collect the >> wordcount >> into a hashtable for further use, how can i do? >> the example just show how to let the result onto disk. >> myemail is : andy2005...@gmail.com >> looking forward your help. thanks a lot. >> -- >> View this message in context: >> http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html >> Sent from the Hadoop core-user mailing list archive at Nabble.com. >> >> > > > -- > M. Raşit ÖZDAŞ > > -- View this message in context: http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22848070.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Running MapReduce without setJar
Yes, as an additional info, you can use this code just to start the job, not wait until it's finished: JobClient client = new JobClient(conf); client.runJob(conf); 2009/4/1 javateck javateck > you can run from java program: > >JobConf conf = new JobConf(MapReduceWork.class); > >// setting your params > >JobClient.runJob(conf); > >
Re: Reducer side output
I think it's about that you have no right to access to the path you define. Did you try it with a path under your user directory? You can change permissions from console. 2009/4/1 Nagaraj K > Hi, > > I am trying to do a side-effect output along with the usual output from the > reducer. > But for the side-effect output attempt, I get the following error. > > org.apache.hadoop.fs.permission.AccessControlException: > org.apache.hadoop.fs.permission.AccessControlException: Permission denied: > user=nagarajk, access=WRITE, inode="":hdfs:hdfs:rwxr-xr-x >at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) >at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) >at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) >at java.lang.reflect.Constructor.newInstance(Constructor.java:513) >at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:90) >at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:52) >at > org.apache.hadoop.dfs.DFSClient$DFSOutputStream.(DFSClient.java:2311) >at org.apache.hadoop.dfs.DFSClient.create(DFSClient.java:477) >at > org.apache.hadoop.dfs.DistributedFileSystem.create(DistributedFileSystem.java:178) >at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:503) >at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:484) >at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:391) >at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:383) >at > org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1310) >at > org.yahoo.delphi.DecisionTree$AttStatReducer.reduce(DecisionTree.java:1275) >at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:319) >at > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2206) > > My reducer code; > = > conf.set("group_stat", "some_path"); // Set during the configuration of > jobconf object > > public static class ReducerClass extends MapReduceBase implements > Reducer { >FSDataOutputStream part=null; >JobConf conf; > >public void reduce(Text key, Iterator values, > OutputCollector output, > Reporter reporter) throws IOException { >double i_sum = 0.0; >while (values.hasNext()) { >i_sum += ((Double) values.next()).valueOf(); >} >String [] fields = key.toString().split(SEP); >if(fields.length==1) >{ > if(part==null) > { > FileSystem fs = FileSystem.get(conf); >String jobpart = > conf.get("mapred.task.partition"); >part = fs.create(new > Path(conf.get("group_stat"),"/part-000"+jobpart)) ; // Failing here > } > part.writeBytes(fields[0] +"\t" + i_sum +"\n"); > >} >else >output.collect(key, new DoubleWritable(i_sum)); >} > } > > Can you guys let me know what I am doing wrong here!. > > Thanks > Nagaraj K > -- M. Raşit ÖZDAŞ
RE: Cannot resolve Datonode address in slave file
Hi Rasit, Now I got a different problem when I start my Hadoop server the slave datanode do not accept password. It gives message permission denied. I have also use the commands on all m/c $ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa $ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys But my problem is not solved. Any suggestion? -Original Message- From: Rasit OZDAS [mailto:rasitoz...@gmail.com] Sent: Thursday, April 02, 2009 6:49 PM To: core-user@hadoop.apache.org Subject: Re: Cannot resolve Datonode address in slave file Hi, Sim, I've two suggessions, if you haven't done yet: 1. Check if your other hosts can ssh to master. 2. Take a look at logs of other hosts. 2009/4/2 Puri, Aseem > > Hi > >I have a small Hadoop cluster with 3 machines. One is my > NameNode/JobTracker + DataNode/TaskTracker and other 2 are > DataNode/TaskTracker. So I have made all 3 as slave. > > > > In slave file I have put names of all there machines as: > > > > master > > slave > > slave1 > > > > When I start Hadoop cluster it always start DataNode/TaskTracker on last > slave in the list and do not start DataNode/TaskTracker on other two > machines. Also I got the message as: > > > > slave1: > > : no address associated with name > > : no address associated with name > > slave1: starting datanode, logging to > /home/HadoopAdmin/hadoop/bin/../logs/hadoo > > p-HadoopAdmin-datanode-ie11dtxpficbfise.out > > > > If I change the order in slave file like this: > > > > slave > > slave1 > > master > > > > then DataNode/TaskTracker on master m/c starts and not on other two. > > > > Please tell how I should solve this problem. > > > > Sim > > -- M. Raşit ÖZDAŞ
Re: Strange Reduce Bahavior
Yes, we've constructed a local version of a hadoop process, We needed 500 input files in hadoop to reach the speed of local process, total time was 82 seconds in a cluster of 6 machines. And I think it's a good performance among other distributed processing systems. 2009/4/2 jason hadoop > 3) The framework is designed for working on large clusters of machines > where > there needs to be a little delay between operations to avoid massive > network > loading spikes, and the initial setup of the map task execution environment > on a machine, and the initial setup of the reduce task execution > environment > take a bit of time. > In production jobs, these delays and setup times are lost in the overall > task run time. > In the small test job case the delays and setup times will be the bulk of > the time spent executing the test. > > >
Re: Cannot resolve Datonode address in slave file
Hi, Sim, I've two suggessions, if you haven't done yet: 1. Check if your other hosts can ssh to master. 2. Take a look at logs of other hosts. 2009/4/2 Puri, Aseem > > Hi > >I have a small Hadoop cluster with 3 machines. One is my > NameNode/JobTracker + DataNode/TaskTracker and other 2 are > DataNode/TaskTracker. So I have made all 3 as slave. > > > > In slave file I have put names of all there machines as: > > > > master > > slave > > slave1 > > > > When I start Hadoop cluster it always start DataNode/TaskTracker on last > slave in the list and do not start DataNode/TaskTracker on other two > machines. Also I got the message as: > > > > slave1: > > : no address associated with name > > : no address associated with name > > slave1: starting datanode, logging to > /home/HadoopAdmin/hadoop/bin/../logs/hadoo > > p-HadoopAdmin-datanode-ie11dtxpficbfise.out > > > > If I change the order in slave file like this: > > > > slave > > slave1 > > master > > > > then DataNode/TaskTracker on master m/c starts and not on other two. > > > > Please tell how I should solve this problem. > > > > Sim > > -- M. Raşit ÖZDAŞ
Re: reducer in M-R
Since every file name is different, you have a unique key for each map output. That means, every iterator has only one element. So you won't need to search for a given name. But it's possible that I misunderstood you. 2009/4/2 Vishal Ghawate > Hi , > > I just wanted to know that values parameter passed to the reducer is always > iterator , > > Which is then used to iterate through for particular key > > Now I want to use file name as key and file content as its value > > So how can I set the parameters in the reducer > > > > Can anybody please help me on this. > > > DISCLAIMER > == > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. > -- M. Raşit ÖZDAŞ
Re: HELP: I wanna store the output value into a list not write to the disk
Hi, hadoop is normally designed to write to disk. There are a special file format, which writes output to RAM instead of disk. But I don't have an idea if it's what you're looking for. If what you said exists, there should be a mechanism which sends output as objects rather than file content across computers, as far as I know there is no such feature yet. Good luck. 2009/4/2 andy2005cst > > I need to use the output of the reduce, but I don't know how to do. > use the wordcount program as an example if i want to collect the wordcount > into a hashtable for further use, how can i do? > the example just show how to let the result onto disk. > myemail is : andy2005...@gmail.com > looking forward your help. thanks a lot. > -- > View this message in context: > http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- M. Raşit ÖZDAŞ
Re: mapreduce problem
MultipleOutputFormat would be what you want. It supplies multiple files as output. I can paste some code here if you want.. 2009/4/2 Vishal Ghawate > Hi, > > I am new to map-reduce programming model , > > I am writing a MR that will process the log file and results are written > to > different files on hdfs based on some values in the log file > >The program is working fine even if I haven't done any > processing in reducer ,I am not getting how to use reducer for solving my > problem efficiently > > Can anybody please help me on this. > > > > > DISCLAIMER > == > This e-mail may contain privileged and confidential information which is > the property of Persistent Systems Ltd. It is intended only for the use of > the individual or entity to which it is addressed. If you are not the > intended recipient, you are not authorized to read, retain, copy, print, > distribute or use this message. If you have received this communication in > error, please notify the sender and delete all copies of this message. > Persistent Systems Ltd. does not accept any liability for virus infected > mails. > -- M. Raşit ÖZDAŞ
Re: Amazon Elastic MapReduce
On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote: seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. Not everyone has a support team or an operations team or enough time to learn how to do it themselves. You're basically paying for the fact that the only thing you need to know to use Hadoop is: 1) Be able to write the Java classes. 2) Press the "go" button on a webpage somewhere. You could use Hadoop with little-to-zero systems knowledge (and without institutional support), which would always make some researchers happy. Brian On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne wrote: ... and only in the US Miles 2009/4/2 zhang jianfeng : Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/ -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
reducer in M-R
Hi , I just wanted to know that values parameter passed to the reducer is always iterator , Which is then used to iterate through for particular key Now I want to use file name as key and file content as its value So how can I set the parameters in the reducer Can anybody please help me on this. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
HELP: I wanna store the output value into a list not write to the disk
I need to use the output of the reduce, but I don't know how to do. use the wordcount program as an example if i want to collect the wordcount into a hashtable for further use, how can i do? the example just show how to let the result onto disk. myemail is : andy2005...@gmail.com looking forward your help. thanks a lot. -- View this message in context: http://www.nabble.com/HELP%3A-I-wanna-store-the-output-value-into-a-list-not-write-to-the-disk-tp22844277p22844277.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: hadoop job controller
You can get the job progress and completion status through an instance of org.apache.hadoop.mapred.JobClient . If you really want to use perl I guess you still need to write a small java application that talks to perl and JobClient on the other side. Theres also some support for Thrift in the hadoop contrib package, but I'm not sure if it exposes any job client related methods. On Thu, Apr 2, 2009 at 12:46 AM, Elia Mazzawi wrote: > > I'm writing a perl program to submit jobs to the cluster, > then wait for the jobs to finish, and check that they have completed > successfully. > > I have some questions, > > this shows what is running > ./hadoop job -list > > and this shows the completion > ./hadoop job -status job_200903061521_0045 > > > but i want something that just says pass / fail > cause with these, i have to check that its done then check that its 100% > completed. > > which must exist since the webapp jobtracker.jsp knows what is what. > > also a controller like that must have been written many times already, are > there any around? > > Regards, > Elia >
mapreduce problem
Hi, I am new to map-reduce programming model , I am writing a MR that will process the log file and results are written to different files on hdfs based on some values in the log file The program is working fine even if I haven't done any processing in reducer ,I am not getting how to use reducer for solving my problem efficiently Can anybody please help me on this. DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you are not authorized to read, retain, copy, print, distribute or use this message. If you have received this communication in error, please notify the sender and delete all copies of this message. Persistent Systems Ltd. does not accept any liability for virus infected mails.
Re: Join Variation
.. and is not yet available as an alpha book chapter. Any chance uploading it? On Thu, Apr 2, 2009 at 4:21 AM, jason hadoop wrote: > Just for fun, chapter 9 in my book is a work through of solving this class > of problem. > > > On Thu, Mar 26, 2009 at 7:07 AM, jason hadoop wrote: > >> For the classic map/reduce job, you have 3 requirements. >> >> 1) a comparator that provide the keys in ip address order, such that all >> keys in one of your ranges, would be contiguous, when sorted with the >> comparator >> 2) a partitioner that ensures that all keys that should be together end up >> in the same partition >> 3) and output value grouping comparator that considered all keys in a >> specified range equal. >> >> The comparator only sorts by the first part of the key, the search file has >> a 2 part key begin/end the input data has just a 1 part key. >> >> A partitioner that new ahead of time the group sets in your search set, in >> the way that the tera sort example works would be ideal: >> ie: it builds an index of ranges from your seen set so that the ranges get >> rougly evenly split between your reduces. >> This requires a pass over the search file to write out a summary file, >> which is then loaded by the partitioner. >> >> The output value grouping comparator, will get the keys in order of the >> first token, and will define the start of a group by the presence of a 2 >> part key, and consider the group ended when either another 2 part key >> appears, or when the key value is larger than the second part of the >> starting key. - This does require that the grouping comparator maintain >> state. >> >> At this point, your reduce will be called with the first key in the key >> equivalence group of (3), with the values of all of the keys >> >> In your map, any address that is not in a range of interest is not passed >> to output.collect. >> >> For the map side join code, you have to define a comparator on the key type >> that defines your definition of equivalence and ordering, and call >> WritableComparator.define( Key.class, comparator.class ), to force the join >> code to use your comparator. >> >> For tables with duplicates, per the key comparator, in map side join, your >> map fuction will receive a row for every permutation of the duplicate keys: >> if you have one table a, 1; a, 2; and another table with a, 3; a, 4; your >> map will receive4 rows, a, 1, 3; a, 1, 4; a, 2, 3; a, 2, 4; >> >> >> >> On Wed, Mar 25, 2009 at 11:19 PM, Tamir Kamara wrote: >> >>> Thanks for all who replies. >>> >>> Stefan - >>> I'm unable to see how converting IP ranges to network masks would help >>> because different ranges can have the same network mask and with that I >>> still have to do a comparison of two fields: the searched IP with >>> from-IP&mask. >>> >>> Pig - I'm familier with pig and use it many times, but I can't think of a >>> way to write a pig script that will do this type of "join". I'll ask the >>> pig >>> users group. >>> >>> The search file is indeed large in terms of the amount records. However, I >>> don't see this as an issue yet, because I'm still puzzeled with how to >>> write >>> the job in plain MR. The join code is looking for an exact match in the >>> keys >>> and that is not what I need. Would a custom comperator which will look for >>> a >>> match in between the ranges, be the right choice to do this ? >>> >>> Thanks, >>> Tamir >>> >>> On Wed, Mar 25, 2009 at 5:23 PM, jason hadoop >> >wrote: >>> >>> > If the search file data set is large, the issue becomes ensuring that >>> only >>> > the required portion of search file is actually read, and that those >>> reads >>> > are ordered, in search file's key order. >>> > >>> > If the data set is small, most any of the common patterns will work. >>> > >>> > I haven't looked at pig for a while, does pig now use indexes in map >>> files, >>> > and take into account that a data set is sorted? >>> > Out of the box, the map side join code, org.apache.hadoop.mapred.join >>> will >>> > do a decent job of this, but the entire search file set will be read. >>> > To stop reading the entire search file, a record reader or join type, >>> would >>> > need to be put together to: >>> > a) skip to the first key of interest, using the index if available >>> > b) finish when the last possible key of interest has been delivered. >>> > >>> > On Wed, Mar 25, 2009 at 6:05 AM, John Lee >>> wrote: >>> > >>> > > In addition to other suggestions, you could also take a look at >>> > > building a Cascading job with a custom Joiner class. >>> > > >>> > > - John >>> > > >>> > > On Tue, Mar 24, 2009 at 7:33 AM, Tamir Kamara >>> > > wrote: >>> > > > Hi, >>> > > > >>> > > > We need to implement a Join with a between operator instead of an >>> > equal. >>> > > > What we are trying to do is search a file for a key where the key >>> falls >>> > > > between two fields in the search file like this: >>> > > > >>> > > > main file (ip, a, b): >>> > > > (80, zz, yy) >>> > > > (125, vv, bb)
Announcing Amazon Elastic MapReduce
Dear Hadoop community, We are excited today to introduce the public beta of Amazon Elastic MapReduce, a web service that enables developers to easily and cost-effectively process vast amounts of data. It utilizes a hosted Hadoop (0.18.3) running on the web-scale infrastructure of Amazon Elastic Compute Cloud (Amazon EC2) and Amazon Simple Storage Service (Amazon S3). Using Amazon Elastic MapReduce, you can instantly provision as much or as little capacity as you like to perform data-intensive tasks for applications such as web indexing, data mining, log file analysis, machine learning, financial analysis, scientific simulation, and bioinformatics research. Amazon Elastic MapReduce lets you focus on crunching or analyzing your data without having to worry about time-consuming set-up, management or tuning of Hadoop clusters or the compute capacity upon which they sit. Working with the service is easy: Develop your processing application using our samples or by building your own, upload your data to Amazon S3, use the AWS Management Console or APIs to specify the number and type of instances you want, and click "Create Job Flow." We do the rest, running Hadoop over the number of specified instances, providing progress monitoring, and delivering the output to Amazon S3. We will be posting several patches to Hadoop today and are hoping to become a part of this exciting community moving forward. We hope this new service will prove a powerful tool for your data processing needs and becomes a great development platform to build sophisticated data processing applications. You can sign up and start using the service today at http://aws.amazon.com/elasticmapreduce. Our forums are available to ask any questions or suggest features: http://developer.amazonwebservices.com/connect/forum.jspa?forumID=52 Sincerely, The Amazon Web Services Team
Re: Amazon Elastic MapReduce
seems like I should pay for additional money, so why not configure a hadoop cluster in EC2 by myself. This already have been automatic using script. On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne wrote: > ... and only in the US > > Miles > > 2009/4/2 zhang jianfeng : > > Does it support pig ? > > > > > > On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: > > > >> > >> FYI > >> > >> Amazons new Hadoop offering: > >> http://aws.amazon.com/elasticmapreduce/ > >> > >> And Cascading 1.0 supports it: > >> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html > >> > >> cheers, > >> ckw > >> > >> -- > >> Chris K Wensel > >> ch...@wensel.net > >> http://www.cascading.org/ > >> http://www.scaleunlimited.com/ > >> > >> > > > > > > -- > The University of Edinburgh is a charitable body, registered in > Scotland, with registration number SC005336. >
Re: Amazon Elastic MapReduce
... and only in the US Miles 2009/4/2 zhang jianfeng : > Does it support pig ? > > > On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: > >> >> FYI >> >> Amazons new Hadoop offering: >> http://aws.amazon.com/elasticmapreduce/ >> >> And Cascading 1.0 supports it: >> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html >> >> cheers, >> ckw >> >> -- >> Chris K Wensel >> ch...@wensel.net >> http://www.cascading.org/ >> http://www.scaleunlimited.com/ >> >> > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
Re: Amazon Elastic MapReduce
Does it support pig ? On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: > > FYI > > Amazons new Hadoop offering: > http://aws.amazon.com/elasticmapreduce/ > > And Cascading 1.0 supports it: > http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html > > cheers, > ckw > > -- > Chris K Wensel > ch...@wensel.net > http://www.cascading.org/ > http://www.scaleunlimited.com/ > >
Amazon Elastic MapReduce
FYI Amazons new Hadoop offering: http://aws.amazon.com/elasticmapreduce/ And Cascading 1.0 supports it: http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html cheers, ckw -- Chris K Wensel ch...@wensel.net http://www.cascading.org/ http://www.scaleunlimited.com/
Cannot resolve Datonode address in slave file
Hi I have a small Hadoop cluster with 3 machines. One is my NameNode/JobTracker + DataNode/TaskTracker and other 2 are DataNode/TaskTracker. So I have made all 3 as slave. In slave file I have put names of all there machines as: master slave slave1 When I start Hadoop cluster it always start DataNode/TaskTracker on last slave in the list and do not start DataNode/TaskTracker on other two machines. Also I got the message as: slave1: : no address associated with name : no address associated with name slave1: starting datanode, logging to /home/HadoopAdmin/hadoop/bin/../logs/hadoo p-HadoopAdmin-datanode-ie11dtxpficbfise.out If I change the order in slave file like this: slave slave1 master then DataNode/TaskTracker on master m/c starts and not on other two. Please tell how I should solve this problem. Sim
Re: Strange Reduce Bahavior
1) when running in pseudo-distributed mode, only 2 values for the reduce count are accepted, 0 and 1. All other positive values are mapped to 1. 2) The single reduce task spawned has several steps, and each of these steps account for about 1/3 of it's overall progress. The 1st third, is collecting all of the map outputs, each of which in your example has 0 records. The 2nd third, is to produce a single sorted set from all of the map outputs. The 3rd third, is to reduce the sorted set. So, you get progress reports, 3) The framework is designed for working on large clusters of machines where there needs to be a little delay between operations to avoid massive network loading spikes, and the initial setup of the map task execution environment on a machine, and the initial setup of the reduce task execution environment take a bit of time. In production jobs, these delays and setup times are lost in the overall task run time. In the small test job case the delays and setup times will be the bulk of the time spent executing the test. On Wed, Apr 1, 2009 at 10:31 PM, Sriram Krishnan wrote: > Hi all, > > I am new to this list, and relatively new to Hadoop itself. So if this > question has been answered before, please point me to the right thread. > > We are investigating the use of Hadoop for processing of geo-spatial data. > In its most basic form, out data is laid out in files, where every row has > the format - > {index, x, y, z, } > > I am writing some basic Hadoop programs for selecting data based on x and y > values, and everything appears to work correctly. I have Hadoop 0.19.1 > running in pseudo-distributed on a Linux box. However, as a academic > exercise, I began writing some code that simply reads every single line of > my input file, and does nothing else - I hoped to gain an understanding on > how long it would take for Hadoop/HDFS to read the entire data set. My Map > and Reduce functions are as follows: > >public void map(LongWritable key, Text value, >OutputCollector output, >Reporter reporter) throws IOException { > >// do nothing >return; >} > >public void reduce(Text key, Iterator values, > OutputCollector output, > Reporter reporter) throws IOException { >// do nothing >return; >} > > My understanding is that the above map function will produce no > intermediate key/value pairs - and hence, the reduce function should take no > time at all. However, when I run this code, Hadoop seems to spend an > inordinate amount of time in the reduce phase. Here is the Hadoop output - > > 09/04/01 20:11:12 INFO mapred.JobClient: Running job: job_200904011958_0005 > 09/04/01 20:11:13 INFO mapred.JobClient: map 0% reduce 0% > 09/04/01 20:11:21 INFO mapred.JobClient: map 3% reduce 0% > 09/04/01 20:11:25 INFO mapred.JobClient: map 7% reduce 0% > > 09/04/01 20:13:17 INFO mapred.JobClient: map 96% reduce 0% > 09/04/01 20:13:20 INFO mapred.JobClient: map 100% reduce 0% > 09/04/01 20:13:30 INFO mapred.JobClient: map 100% reduce 4% > 09/04/01 20:13:35 INFO mapred.JobClient: map 100% reduce 7% > ... > 09/04/01 20:14:05 INFO mapred.JobClient: map 100% reduce 25% > 09/04/01 20:14:10 INFO mapred.JobClient: map 100% reduce 29% > 09/04/01 20:14:15 INFO mapred.JobClient: Job complete: > job_200904011958_0005 > 09/04/01 20:14:15 INFO mapred.JobClient: Counters: 15 > 09/04/01 20:14:15 INFO mapred.JobClient: File Systems > 09/04/01 20:14:15 INFO mapred.JobClient: HDFS bytes read=1787707732 > 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes read=10 > 09/04/01 20:14:15 INFO mapred.JobClient: Local bytes written=932 > 09/04/01 20:14:15 INFO mapred.JobClient: Job Counters > 09/04/01 20:14:15 INFO mapred.JobClient: Launched reduce tasks=1 > 09/04/01 20:14:15 INFO mapred.JobClient: Launched map tasks=27 > 09/04/01 20:14:15 INFO mapred.JobClient: Data-local map tasks=27 > 09/04/01 20:14:15 INFO mapred.JobClient: Map-Reduce Framework > 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input groups=1 > 09/04/01 20:14:15 INFO mapred.JobClient: Combine output records=0 > 09/04/01 20:14:15 INFO mapred.JobClient: Map input records=44967808 > 09/04/01 20:14:15 INFO mapred.JobClient: Reduce output records=0 > 09/04/01 20:14:15 INFO mapred.JobClient: Map output bytes=2 > 09/04/01 20:14:15 INFO mapred.JobClient: Map input bytes=1787601210 > 09/04/01 20:14:15 INFO mapred.JobClient: Combine input records=0 > 09/04/01 20:14:15 INFO mapred.JobClient: Map output records=1 > 09/04/01 20:14:15 INFO mapred.JobClient: Reduce input records=0 > > As you can see, the reduce phase takes a little more than a minute - which > is about a third of the execution time. However, the number of reduce tasks > spawned is 1, and reduce input records is 0. Why does it spend so long on > the reduce ph