Re: Multiple NIC Cards
On Tue, Jun 9, 2009 at 11:59 AM, Steve Loughranste...@apache.org wrote: John Martyniak wrote: When I run either of those on either of the two machines, it is trying to resolve against the DNS servers configured for the external addresses for the box. Here is the result Server: xxx.xxx.xxx.69 Address: xxx.xxx.xxx.69#53 OK. in an ideal world, each NIC has a different hostname. Now, that confuses code that assumes a host has exactly one hostname, not zero or two, and I'm not sure how well Hadoop handles the 2+ situation (I know it doesn't like 0, but hey, its a distributed application). With separate hostnames, you set hadoop up to work on the inner addresses, and give out the inner hostnames of the jobtracker and namenode. As a result, all traffic to the master nodes should be routed on the internal network Also a subtle issue that I run into is full or partial host names. Even though all my configuration files reference full host names server1.domain.com. The name node web interface will redirect people to http://server1 probably because that is the system hostname. In my case it is a big deal but it is something to consider during setup.
Re: Multiple NIC Cards
Also if you are using a topology rack map, make sure you scripts responds correctly to every possible hostname or IP address as well. On Tue, Jun 9, 2009 at 1:19 PM, John Martyniakj...@avum.com wrote: It seems that this is the issue, as there several posts related to same topic but with no resolution. I guess the thing of it is that it shouldn't use the hostname of the machine at all. If I tell it the master is x and it has an IP Address of x.x.x.102 that should be good enough. And if that isn't the case then I should be able to specify which network adaptor to use as the ip address that it is going to lookup against, whether it is by DNS or by /etc/hosts. Because I suspect the problem is that I have named the machine as duey..com but have told hadoop that machine is called duey-direct. Is there work around in 0.19.1? I am using this with Nutch so don't have an option to upgrade at this time. -John On Jun 9, 2009, at 11:59 AM, Steve Loughran wrote: John Martyniak wrote: When I run either of those on either of the two machines, it is trying to resolve against the DNS servers configured for the external addresses for the box. Here is the result Server: xxx.xxx.xxx.69 Address: xxx.xxx.xxx.69#53 OK. in an ideal world, each NIC has a different hostname. Now, that confuses code that assumes a host has exactly one hostname, not zero or two, and I'm not sure how well Hadoop handles the 2+ situation (I know it doesn't like 0, but hey, its a distributed application). With separate hostnames, you set hadoop up to work on the inner addresses, and give out the inner hostnames of the jobtracker and namenode. As a result, all traffic to the master nodes should be routed on the internal network
Re: Monitoring hadoop?
On Fri, Jun 5, 2009 at 10:10 AM, Brian Bockelmanbbock...@cse.unl.edu wrote: Hey Anthony, Look into hooking your Hadoop system into Ganglia; this produces about 20 real-time statistics per node. Hadoop also does JMX, which hooks into more enterprise-y monitoring systems. Brian On Jun 5, 2009, at 8:55 AM, Anthony McCulley wrote: Hey all, I'm currently tasked to come up with a web/flex-based visualization/monitoring system for a cloud system using hadoop as part of a university research project. I was wondering if I could elicit some feedback from all of you with regards to: - If you were an engineer of a cloud system running hadoop, what information would you be interested in capturing, viewing, monitoring, etc? - Is there any sort of real-time stats or monitoring currently available for hadoop? if so, is in a web-friendly format? Thanks in advance, - Anthony I have an ever growing set of cacti templates for hadoop that use jmx as the back end. http://www.jointhegrid.com/hadoop/
Re: Blocks amount is stuck in statistics
On Mon, May 25, 2009 at 6:34 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Ok, was too eager to report :). It got sorted out after some time. Regards. 2009/5/25 Stas Oskin stas.os...@gmail.com Hi. I just did an erase of large test folder with about 20,000 blocks, and created a new one. I copied about 128 blocks, and fsck reflects it correctly, but NN statistics still shows the old number. It does shows the currently used space correctly. Any idea if this a known issue and was fixed? Also, does it has any influence over the operations? I'm using Hadoop 0.18.3. Thanks. This is something I am dealing with in my cacti graphs. Shameless plugs http://www.jointhegrid.com/hadoop Some hadoop JMX attributes are like traditional SNMP gauges. A JMX Request directly gather a counter variable and returns it. FSDatasetStatus Remaining is implemented like that. Some variables are implemented like SNMP counter, DataNodeStatistics BytesRead is like that the number keeps increasing. Other values are sampled. Since I monitor with 5 minute intervals I am setup like this: dfs.class=org.apache.hadoop.metrics.spi.NullContextWithUpdateThread dfs.period=300 DataNodeStatistics BlocksWritten is a gauge that is sampled. So if your sample period is 5 minutes it will be 5 minutes before the stats show up and then in 5 minutes they are replaced. Using NullContextWithUpdateThread with the combination of sampled variables and non sampled variables is not exactly what I want...As some data is 5 minutes behind the real time data, but NullContextWithUpdateThread is working well for me.
Re: ssh issues
Pankil, I used to be very confused by hadoop and SSH keys. SSH is NOT required. Each component can be started by hand. This gem of knowledge is hidden away in the hundreds of DIGG style articles entitled 'HOW TO RUN A HADOOP MULTI-MASTER CLUSTER!' The SSH keys are only required by the shell scripts that are contained with Hadoop like start-all. They are wrappers to kick off other scripts on a list of nodes. I PERSONALLY dislike using SSH keys as a software component and believe they should only be used by administrators. We chose the cloudera distribution. http://www.cloudera.com/distribution. A big factor behind this was the simple init.d scripts they provided. Each hadoop component has its own start scripts hadoop-namenode, hadoop-datanode, etc. My suggestion is taking a look at the Cloudera startup scripts. Even if you decide not to use the distribution you can take a look at their start up scripts and fit them to your needs. On Fri, May 22, 2009 at 10:34 AM, hmar...@umbc.edu wrote: Steve, Security through obscurity is always a good practice from a development standpoint and one of the reasons why tricking you out is an easy task. Please, keep hiding relevant details from people in order to keep everyone smiling. Hal Pankil Doshi wrote: Well i made ssh with passphares. as the system in which i need to login requires ssh with pass phrases and those systems have to be part of my cluster. and so I need a way where I can specify -i path/to key/ and passphrase to hadoop in before hand. Pankil Well, are trying to manage a system whose security policy is incompatible with hadoop's current shell scripts. If you push out the configs and manage the lifecycle using other tools, this becomes a non-issue. Dont raise the topic of HDFS security to your ops team though, as they will probably be unhappy about what is currently on offer. -steve
Re: Optimal Filesystem (and Settings) for HDFS
Do not forget 'tune2fs -m 2'. By default this value gets set at 5%. With 1 TB disks we got 33 GB more usable space. Talk about instant savings! On Mon, May 18, 2009 at 1:31 PM, Alex Loddengaard a...@cloudera.com wrote: I believe Yahoo! uses ext3, though I know other people have said that XFS has performed better in various benchmarks. We use ext3, though we haven't done any benchmarks to prove its worth. This question has come up a lot, so I think it'd be worth doing a benchmark and writing up the results. I haven't been able to find a detailed analysis / benchmark writeup comparing various filesystems, unfortunately. Hope this helps, Alex On Mon, May 18, 2009 at 8:54 AM, Bob Schulze b.schu...@ecircle.com wrote: We are currently rebuilding our cluster - has anybody recommendations on the underlaying file system? Just standard Ext3? I could imagine that the block size could be larger than its default... Thx for any tips, Bob
Re: Linking against Hive in Hadoop development tree
On Fri, May 15, 2009 at 5:05 PM, Aaron Kimball aa...@cloudera.com wrote: Hi all, For the database import tool I'm writing (Sqoop; HADOOP-5815), in addition to uploading data into HDFS and using MapReduce to load/transform the data, I'd like to integrate more closely with Hive. Specifically, to run the CREATE TABLE statements needed to automatically inject table defintions into Hive's metastore for the data files that sqoop loads into HDFS. Doing this requires linking against Hive in some way (either directly by using one of their API libraries, or loosely by piping commands into a Hive instance). In either case, there's a dependency there. I was hoping someone on this list with more Ivy experience than I knows what's the best way to make this happen. Hive isn't in the maven2 repository that Hadoop pulls most of its dependencies from. It might be necessary for sqoop to have access to a full build of Hive. It doesn't seem like a good idea to check that binary distribution into Hadoop svn, but I'm not sure what's the most expedient alternative. Is it acceptable to just require that developers who wish to compile/test/run sqoop have a separate standalone Hive deployment and a proper HIVE_HOME variable? This would keep our source repo clean. The downside here is that it makes it difficult to test Hive-specific integration functionality with Hudson and requires extra leg-work of developers. Thanks, - Aaron Kimball Aaron, I have a similar situation. I am using the GPL geo-ip library as a hive UDF. Due to apache/GPL issues it the code would not be compatible. Currently my build process reference all if the Hive lib/*.jar files. It does not really need all of that but not being exactly sure what I need I reference all of them. I was thinking one option is to run a GIT system. This way I can integrate my patch into my forked hive. I see your problem though, you have a few Hive Entry Points 1) JDBC 2) Hive Thrift Server 3) scripting 4) Java API The JDBC and Thrift should be the lightest. In that a few Jar files would make up the entry point rather then the entire hive distribution. Although now that Hive has had two releases maybe Hive should be in maven. With that hive could be an optional or a mandatory ant target for sqoop.
Re: sub 60 second performance
On Mon, May 11, 2009 at 12:08 PM, Todd Lipcon t...@cloudera.com wrote: In addition to Jason's suggestion, you could also see about setting some of Hadoop's directories to subdirs of /dev/shm. If the dataset is really small, it should be easy to re-load it onto the cluster if it's lost, so even putting dfs.data.dir in /dev/shm might be worth trying. You'll probably also want mapred.local.dir in /dev/shm Note that if in fact you don't have enough RAM to do this, you'll start swapping and your performance will suck like crazy :) That said, you may find that even with all storage in RAM your jobs are still too slow. Hadoop isn't optimized for this kind of small-job performance quite yet. You may find that task setup time dominates the job. I think it's entirely reasonable to shoot for sub-60-second jobs down the road, and I'd find it interesting to hear what the results are now. Hope you report back! -Todd On Sun, May 10, 2009 at 2:30 PM, Matt Bowyer mattbowy...@googlemail.comwrote: Hi, I am trying to do 'on demand map reduce' - something which will return in reasonable time (a few seconds). My dataset is relatively small and can fit into my datanode's memory. Is it possible to keep a block in the datanode's memory so on the next job the response will be much quicker? The majority of the time spent during the job run appears to be during the 'HDFS_BYTES_READ' part of the job. I have tried using the setNumTasksToExecutePerJvm but the block still seems to be cleared from memory after the job. thanks! Also if your data set is small your can reduce overhead and (parallelism) by lowering the number of mappers and reducers. -Dmapred.map.tasks=11 -Dmapred.reduce.tasks=3 Or maybe even go as low as: -Dmapred.map.tasks=1 -Dmapred.reduce.tasks=1 I use this tactic on jobs with small data sets where the processing time is much less then the overhead of starting multiple mappers/ reducers and shuffling data.
Re: how to improve the Hadoop's capability of dealing with small files
2009/5/7 Jeff Hammerbacher ham...@cloudera.com: Hey, You can read more about why small files are difficult for HDFS at http://www.cloudera.com/blog/2009/02/02/the-small-files-problem. Regards, Jeff 2009/5/7 Piotr Praczyk piotr.prac...@gmail.com If You want to use many small files, they are probably having the same purpose and struc? Why not use HBase instead of a raw HDFS ? Many small files would be packed together and the problem would disappear. cheers Piotr 2009/5/7 Jonathan Cao jonath...@rockyou.com There are at least two design choices in Hadoop that have implications for your scenario. 1. All the HDFS meta data is stored in name node memory -- the memory size is one limitation on how many small files you can have 2. The efficiency of map/reduce paradigm dictates that each mapper/reducer job has enough work to offset the overhead of spawning the job. It relies on each task reading contiguous chuck of data (typically 64MB), your small file situation will change those efficient sequential reads to larger number of inefficient random reads. Of course, small is a relative term? Jonathan 2009/5/6 陈桂芬 chenguifen...@163.com Hi: In my application, there are many small files. But the hadoop is designed to deal with many large files. I want to know why hadoop doesn't support small files very well and where is the bottleneck. And what can I do to improve the Hadoop's capability of dealing with small files. Thanks. When the small file problem comes up most of the talk centers around the inode table being in memory. The cloudera blog points out something: Furthermore, HDFS is not geared up to efficiently accessing small files: it is primarily designed for streaming access of large files. Reading through small files normally causes lots of seeks and lots of hopping from datanode to datanode to retrieve each small file, all of which is an inefficient data access pattern. My application attempted to load 9000 6Kb files using a single threaded application and the FSOutpustStream objects to write directly to hadoop files. My plan was to have hadoop merge these files in the next step. I had to abandon this plan because this process was taking hours. I knew HDFS had a small file problem but I never realized that I could not do this problem the 'old fashioned way'. I merged the files locally and uploading a few small files gave great throughput. Small files is not just a permanent storage issue it is a serious optimization.
Cacti Templates for Hadoop
For those of you that would like to graph the hadoop JMX variables with cacti I have created cacti templates and data input scripts. Currently the package gathers and graphs the following information from the NameNode: Blocks Total Files Total Capacity Used/Capacity Free Live Data Nodes/Dead Data Nodes Obligatory Screen shot http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/doc/cacti-hadoop.png Setup instructions here: http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/doc/INSTALL.txt Project Trunk: http://www.jointhegrid.com/svn/hadoop-cacti-jtg/trunk/ My next steps are creating graphs for other hadoop components (DataNode RPCstats). If you want to hurry this process along drop me a line for extra motivation :) Edward
Re: What do we call Hadoop+HBase+Lucene+Zookeeper+etc....
'cloud computing' is a hot term. According to the definition provided by wikipedia http://en.wikipedia.org/wiki/Cloud_computing, Hadoop+HBase+Lucene+Zookeeper, fits some of the criteria but not well. Hadoop is scalable, with HOD it is dynamically scalable. I do not think (Hadoop+HBase+Lucene+Zookeeper) can be used for 'utility computing'. as managing the stack and getting started is quite a complex process. Also this stack is best running on LAN network with high speed interlinks. Historically the Cloud is composed of WAN links. An implication of Cloud Computing is that different services would be running in different geographical locations which is not how hadoop is normally deployed. I believe 'Apache Grid Stack' would be a more fitting. http://en.wikipedia.org/wiki/Grid_computing Grid computing (or the use of computational grids) is the application of several computers to a single problem at the same time — usually to a scientific or technical problem that requires a great number of computer processing cycles or access to large amounts of data. Grid computing via the Wikipedia definition describes exactly what hadoop does. Without amazon S3 and EC2 hadoop does not fit well into a 'cloud computing' IMHO
Re: Getting free and used space
You can also pull these variables from the name node, datanode with JMX. I am doing this to graph them with cacti. Both the JMX READ/WRITE and READ user can access this variable. On Tue, Apr 28, 2009 at 8:29 AM, Stas Oskin stas.os...@gmail.com wrote: Hi. Any idea if the getDiskStatus() function requires superuser rights? Or it can work for any user? Thanks. 2009/4/9 Aaron Kimball aa...@cloudera.com You can insert this propery into the jobconf, or specify it on the command line e.g.: -D hadoop.job.ugi=username,group,group,group. - Aaron On Wed, Apr 8, 2009 at 7:04 AM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Stas, What we do locally is apply the latest patch for this issue: https://issues.apache.org/jira/browse/HADOOP-4368 This makes getUsed (actually, it switches to FileSystem.getStatus) not a privileged action. As far as specifying the user ... gee, I can't think of it off the top of my head. It's a variable you can insert into the JobConf, but I'd have to poke around google or the code to remember which one (I try to not override it if possible). Brian On Apr 8, 2009, at 8:51 AM, Stas Oskin wrote: Hi. Thanks for the explanation. Now for the easier part - how do I specify the user when connecting? :) Is it a config file level, or run-time level setting? Regards. 2009/4/8 Brian Bockelman bbock...@cse.unl.edu Hey Stas, Did you try this as a privileged user? There might be some permission errors... in most of the released versions, getUsed() is only available to the Hadoop superuser. It may be that the exception isn't propagating correctly. Brian On Apr 8, 2009, at 3:13 AM, Stas Oskin wrote: Hi. I'm trying to use the API to get the overall used and free spaces. I tried this function getUsed(), but it always returns 0. Any idea? Thanks.
Re: Hadoop / MySQL
On Wed, Apr 29, 2009 at 10:19 AM, Stefan Podkowinski spo...@gmail.com wrote: If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine. The only thing you have to do is to copy your hadoop produced csv file into the mysql data directory and issue a flush tables command to have mysql flush its caches and pickup the new file. Its very simple and you have the full set of sql commands available just as with innodb or myisam. What you don't get with the csv engine are indexes and foreign keys. Can't have it all, can you? Stefan On Tue, Apr 28, 2009 at 9:23 PM, Bill Habermaas b...@habermaas.us wrote: Excellent discussion. Thank you Todd. You're forgiven for being off topic (at least by me). :) Bill -Original Message- From: Todd Lipcon [mailto:t...@cloudera.com] Sent: Tuesday, April 28, 2009 2:29 PM To: core-user Subject: Re: Hadoop / MySQL Warning: derailing a bit into MySQL discussion below, but I think enough people have similar use cases that it's worth discussing this even though it's gotten off-topic. 2009/4/28 tim robertson timrobertson...@gmail.com So we ended up with 2 DBs - DB1 we insert to, prepare and do batch processing - DB2 serving the read only web app This is a pretty reasonable and common architecture. Depending on your specific setup, instead of flip-flopping between DB1 and DB2, you could actually pull snapshots of MyISAM tables off DB1 and load them onto other machines. As long as you've flushed the tables with a read lock, MyISAM tables are transferrable between machines (eg via rsync). Obviously this can get a bit hairy, but it's a nice trick to consider for this kind of workflow. Why did we end up with this? Because of locking on writes that kill reads as you say... basically you can't insert when a read is happening on myisam as it locks the whole table. This is only true if you have binary logging enabled. Otherwise, myisam supports concurrent inserts with reads. That said, binary logging is necessary if you have any slaves. If you're loading bulk data from the result of a mapreduce job, you might be better off not using replication and simply loading the bulk data to each of the serving replicas individually. Turning off the binary logging will also double your write speed (LOAD DATA writes the entirety of the data to the binary log as well as to the table) InnoDB has row level locking to get around this but in our experience (at the time we had 130million records) it just didn't work either. You're quite likely to be hitting the InnoDB autoincrement lock if you have an autoincrement primary key here. There are fixes for this in MySQL 5.1. The best solution is to avoid autoincrement primary keys and use LOAD DATA for these kind of bulk loads, as others have suggested. We spent €10,000 for the supposed european expert on mysql from their professional services and were unfortunately very disappointed. Seems such large tables are just problematic with mysql. We are now very much looking into Lucene technologies for search and Hadoop for reporting and datamining type operations. SOLR does a lot of what our DB does for us. Yep - oftentimes MySQL is not the correct solution, but other times it can be just what you need. If you already have competencies with MySQL and a good access layer from your serving tier, it's often easier to stick with MySQL than add a new technology into the mix. So with myisam... here is what we learnt: Only very latest mysql versions (beta still I think) support more than 4G memory for indexes (you really really need the index in memory, and where possible the FK for joins in the index too). As far as I know, any 64-bit mysql instance will use more than 4G without trouble. Mysql has differing join strategies between innoDB and myisam, so be aware. I don't think this is true. Joining happens at the MySQL execution layer, which is above the storage engine API. The same join strategies are available for both. For a particular query, InnoDB and MyISAM tables may end up providing a different query plan based on the statistics that are collected, but given properly analyzed tables, the strategies will be the same. This is how MySQL allows inter-storage-engine joins. If one engine is providing a better query plan, you can use query hints to enforce that plan (see STRAIGHT_JOIN and FORCE INDEX for example) An undocumented feature of myisam is you can create memory buffers for single indexes: In the my.cnf: taxon_concept_cache.key_buffer_size=3990M -- for some reason you have to drop a little under 4G then in the DB run: cache index taxon_concept in taxon_concept_cache; load index into cache taxon_concept; This allows for making sure an index gets into memory for sure. But for most use cases and a properly configured machine you're
Re: Hadoop / MySQL
On Wed, Apr 29, 2009 at 2:48 PM, Todd Lipcon t...@cloudera.com wrote: On Wed, Apr 29, 2009 at 7:19 AM, Stefan Podkowinski spo...@gmail.comwrote: If you have trouble loading your data into mysql using INSERTs or LOAD DATA, consider that MySQL supports CSV directly using the CSV storage engine. The only thing you have to do is to copy your hadoop produced csv file into the mysql data directory and issue a flush tables command to have mysql flush its caches and pickup the new file. Its very simple and you have the full set of sql commands available just as with innodb or myisam. What you don't get with the csv engine are indexes and foreign keys. Can't have it all, can you? The CSV storage engine is definitely an interesting option, but it has a couple downsides: - Like you mentioned, you don't get indexes. This seems like a huge deal to me - the reason you want to load data into MySQL instead of just keeping it in Hadoop is so you can service real-time queries. Not having any indexing kind of defeats the purpose there. This is especially true since MySQL only supports nested-loop joins, and there's no way of attaching metadata to a CSV table to say hey look, this table is already in sorted order so you can use a merge join. - Since CSV is a text based format, it's likely to be a lot less compact than a proper table. For example, a unix timestamp is likely to be ~10 characters vs 4 bytes in a packed table. - I'm not aware of many people actually using CSV for anything except tutorials and training. Since it's not in heavy use by big mysql users, I wouldn't build a production system around it. Here's a wacky idea that I might be interested in hacking up if anyone's interested: What if there were a MyISAMTableOutputFormat in hadoop? You could use this as a reducer output and have it actually output .frm and .myd files onto HDFS, then simply hdfs -get them onto DB servers for realtime serving. Sounds like a fun hack I might be interested in if people would find it useful. Building the .myi indexes in Hadoop would be pretty killer as well, but potentially more difficult. -Todd The .frm and .myd are binary platform dependent files. You can not even move them from 32bit-64bit. Generating them without native tools would be difficult. Moving then around with HDFS might have merit, although the RSYNC could accomplish the same thing. Derby-DB might be a better candidate for something like this since the underlying DB is cross platform.
Re: max value for a dataset
On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman bbock...@cse.unl.edu wrote: Hey Jason, Wouldn't this be avoided if you used a combiner to also perform the max() operation? A minimal amount of data would be written over the network. I can't remember if the map output gets written to disk first, then combine applied or if the combine is applied and then the data is written to disk. I suspect the latter, but it'd be a big difference. However, the original poster mentioned he was using hbase/pig -- certainly, there's some better way to perform max() in hbase/pig? This list probably isn't the right place to ask if you are using those technologies; I'd suspect they do something more clever (certainly, you're performing a SQL-like operation in MapReduce; not always the best way to approach this type of problem). Brian On Apr 20, 2009, at 8:25 PM, jason hadoop wrote: The Hadoop Framework requires that a Map Phase be run before the Reduce Phase. By doing the initial 'reduce' in the map, a much smaller volume of data has to flow across the network to the reduce tasks. But yes, this could simply be done by using an IdentityMapper and then have all of the work done in the reduce. On Mon, Apr 20, 2009 at 4:26 AM, Shevek had...@anarres.org wrote: On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: The traditional approach would be a Mapper class that maintained a member variable that you kept the max value record, and in the close method of your mapper you output a single record containing that value. Perhaps you can forgive the question from a heathen, but why is this first mapper not also a reducer? It seems to me that it is performing a reduce operation, and that maps should (philosophically speaking) not maintain data from one input to the next, since the order (and location) of inputs is not well defined. The program to compute a maximum should then be a tree of reduction operations, with no maps at all. Of course in this instance, what you propose works, but it does seem puzzling. Perhaps the answer is simple architectural limitation? S. The map method of course compares the current record against the max and stores current in max when current is larger than max. Then each map output is a single record and the reduce behaves very similarly, in that the close method outputs the final max record. A single reduce would be the simplest. On your question a Mapper and Reducer defines 3 entry points, configure, called once on on task start, the map/reduce called once for each record, and close, called once after the last call to map/reduce. at least through 0.19, the close is not provided with the output collector or the reporter, so you need to save them in the map/reduce method. On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com wrote: How do you identify that map task is ending within the map method? Is it possible to know which is the last call to map method? On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read through the entire data set and only have to emit one value per mapper upon completion of the map data. Then I can specify one reducer and carry out the same operation. Does anyone have a better tactic. I thought a counter could do this but are they atomic? -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422 I took a loot at the description of the book http://www.apress.com/book/view/9781430219422. Hopefully it and other endeavors like it can fill a need I have an see quite often. I am quite interested in practical hadoop algorithms. Most of my searching finds repeated WordCount examples, depictions of the shuffle-sort. The most practical lessons I took from my programming with Fortran was how to sum() min() max() and average() a data set. If the hadoop had a cookbook of sorts for algorithm design I think many people would benefit.
Re: max value for a dataset
Yes I considered Shevek's tactic as well, but as Jason pointed out emit ing the entire data set just to find the maximum value would be wasteful, you do not want to sort the dataset, you just want to break it in parts and find the max value of each part, then bring it into one part and perform that operation again. The way I look at it are the 'best' hadoop algorithms are the ones that emit less key pairs. What Jason suggested, and the MapRunner concept I was looking at, would both emit about the same amount of key pairs. I am curious to see if the MapRunner implementation would run faster due to less calls to the map function. After all MapRunner is only iterating over the data set. On Mon, Apr 20, 2009 at 8:25 AM, jason hadoop jason.had...@gmail.com wrote: The Hadoop Framework requires that a Map Phase be run before the Reduce Phase. By doing the initial 'reduce' in the map, a much smaller volume of data has to flow across the network to the reduce tasks. But yes, this could simply be done by using an IdentityMapper and then have all of the work done in the reduce. On Mon, Apr 20, 2009 at 4:26 AM, Shevek had...@anarres.org wrote: On Sat, 2009-04-18 at 09:57 -0700, jason hadoop wrote: The traditional approach would be a Mapper class that maintained a member variable that you kept the max value record, and in the close method of your mapper you output a single record containing that value. Perhaps you can forgive the question from a heathen, but why is this first mapper not also a reducer? It seems to me that it is performing a reduce operation, and that maps should (philosophically speaking) not maintain data from one input to the next, since the order (and location) of inputs is not well defined. The program to compute a maximum should then be a tree of reduction operations, with no maps at all. Of course in this instance, what you propose works, but it does seem puzzling. Perhaps the answer is simple architectural limitation? S. The map method of course compares the current record against the max and stores current in max when current is larger than max. Then each map output is a single record and the reduce behaves very similarly, in that the close method outputs the final max record. A single reduce would be the simplest. On your question a Mapper and Reducer defines 3 entry points, configure, called once on on task start, the map/reduce called once for each record, and close, called once after the last call to map/reduce. at least through 0.19, the close is not provided with the output collector or the reporter, so you need to save them in the map/reduce method. On Sat, Apr 18, 2009 at 9:28 AM, Farhan Husain russ...@gmail.com wrote: How do you identify that map task is ending within the map method? Is it possible to know which is the last call to map method? On Sat, Apr 18, 2009 at 10:59 AM, Edward Capriolo edlinuxg...@gmail.com wrote: I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read through the entire data set and only have to emit one value per mapper upon completion of the map data. Then I can specify one reducer and carry out the same operation. Does anyone have a better tactic. I thought a counter could do this but are they atomic? -- Alpha Chapters of my book on Hadoop are available http://www.apress.com/book/view/9781430219422
max value for a dataset
I jumped into Hadoop at the 'deep end'. I know pig, hive, and hbase support the ability to max(). I am writing my own max() over a simple one column dataset. The best solution I came up with was using MapRunner. With maprunner I can store the highest value in a private member variable. I can read through the entire data set and only have to emit one value per mapper upon completion of the map data. Then I can specify one reducer and carry out the same operation. Does anyone have a better tactic. I thought a counter could do this but are they atomic?
Re: Using HDFS to serve www requests
but does Sun's Lustre follow in the steps of Gluster then Yes. IMHO GlusterFS advertises benchmarks vs Luster. The main difference is that GlusterFS is a fuse (userspace filesystem) while Luster has to be patched into the kernel, or a module.
Re: virtualization with hadoop
I use linux-vserver http://linux-vserver.org/ The Linux-VServer technology is a soft partitioning concept based on Security Contexts which permits the creation of many independent Virtual Private Servers (VPS) that run simultaneously on a single physical server at full speed, efficiently sharing hardware resources. Usually whenever people talk about virtual machines, I always here about VMware, Xen, QEMU. For MY purposes Linux Vserver is far superior to all of them and its very helpful for the hadoop work I do. (I only want linux guests) No emulation overhead - I installed VMWare server on my laptop and was able to get 3 linux instances running before the system was unusable, the instances were not even doing anything. With VServer my system is not wasting cycles emulating devices. VMs are securely sharing a kernel and memory. You can effectively run many more VMs at once. This leaves the processor for user processes (hadoop) not emulation overhear. A minimal installation is 50 MB. I do not need a multi GB Linux install just to test a version of hadoop. This allows me to recklessly make VMs for whatever I want and not have to worry about GB chunks of my hard drive going with each VM. I can tar up a VM and use it as a template to install another VM. Thus I can deploy a new system in under 30 seconds. The HTTP RPM install takes about 2 minutes. The guest is chroot 'ed. I can easily copy files into the guest using copy commands. Think ant deploy -DTARGETDIR=/path/to/guest. But it is horrible slow if you not have enough ram and multiple disks since all I/o-Operations go to the same disk. VServer will not solve this problem, but at least you want be losing IO to 'emulation'. If you are working with hadoop and you need to be able to have multiple versions running, with different configurations, take a look at VServer.
Re: Using HDFS to serve www requests
It is a little more natural to connect to HDFS from apache tomcat. This will allow you to skip the FUSE mounts and just use the HDFS-API. I have modified this code to run inside tomcat. http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample I will not testify to how well this setup will perform under internet traffic, but it does work. GlusterFS is more like a traditional POSIX filesystem. It supports locking and appends and you can do things like put the mysql data directory on it. GLUSTERFS is geared for storing data to be accessed with low latency. Nodes (Bricks) are normally connected via GIG-E or infiniban. The GlusterFS volume is mounted directly on a unix system. Hadoop is a user space file system. The latency is higher. Nodes are connected by GIG-E. It is closely coupled with MAP/REDUCE. You can use the API or the FUSE module to mount hadoop but that is not a direct goal of hadoop. Hope that helps.
Re: Using Hadoop for near real-time processing of log data
On Wed, Feb 25, 2009 at 1:13 PM, Mikhail Yakshin greycat.na@gmail.com wrote: Hi, Is anyone using Hadoop as more of a near/almost real-time processing of log data for their systems to aggregate stats, etc? We do, although near realtime is pretty relative subject and your mileage may vary. For example, startups / shutdowns of Hadoop jobs are pretty expensive and it could take anything from 5-10 seconds up to several minutes to get the job started and almost same thing goes for job finalization. Generally, if your near realtime would tolerate 3-4-5 minutes lag, it's possible to use Hadoop. -- WBR, Mikhail Yakshin I was thinking about this. Assuming your datasets are small would running a local jobtracker or even running the MinimMR cluster from the test case be an interesting way to run small jobs confided to one CPU?
Re: Using Hadoop for near real-time processing of log data
Yeah, but what's the point of using Hadoop then? i.e. we lost all the parallelism? Some jobs do not need it. For example, I am working with the Hive sub project. If I have a table that is less then my block size. Having a large number of mappers or reducers is counter productive. Hadoop will start up mappers that never get any data. Setting the job tracker to 'local' or setting map tasks and reduce tasks to 1 makes the job finish faster. 20 seconds vs 10 seconds. If you have a small data set and a system with 8 cores, the MiniMR cluster can possibly be used as an embedded hadoop. For some jobs the most efficient parallelism might be 1. WordCount of 1 2 3 4 5 6 on the MiniMRCluster test case takes less then two seconds. It may not be the common case, but it may be feasible to use hadoop in that manner.
Re: Batching key/value pairs to map
We have a MR program that collects once for each token on a line. What types of applications can benefit from batch mapping?
Hadoop JMX
I am working to graph the hadoop JMX variables. http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/dfs/namenode/metrics/NameNodeStatistics.html I have a two nodes, one running 0.17 and the other running.0.19 The NameNode JMX objects and attributes seem to be working well. I am graphing Capacity, NumberOfBlocks, NumberOfFiles, as well as the operations numLiveDataNodes() numDeadDataNodes() It seems like the DataNode JMX objects are mostly 0 or -1. I do not have heavy load on these systems so telling if the counter is implemented is tricky. My questions: 1) If a JMX attribute is added, is it generally added as a placeholder to be implemented later or is it added implemented? 2) Is there a target version to have all these attributes implemented, or are these all being handled via separate Jira? 3) Can I set TaskTrackers to be monitored as I can for DataNodes, NameNodes? 4) Tips tricks gotcha? Thank you
Re: Pluggable JDBC schemas [Was: How to use DBInputFormat?]
One thing to mention is 'limit' is not SQL standard. Microsoft SQL Server uses the SELECT TOP 100 FROM table. Some RDBMS may not support any such syntax. To be more SQL compliant you should use some data like an auto ID or DATE column for an offset. It is tricky to write anything truly database agnostic though. On Fri, Feb 13, 2009 at 8:18 AM, Fredrik Hedberg fred...@avafan.com wrote: Hi, Please let us know how this works out. Also, it would be nice if people with experience with other RDMBS than MySQL and Oracle could comment on the syntax and performance of their respective RDBMS with regard to Hadoop. Even if the syntax of the current SQL queries are valid for other systems than MySQL, some users would surely benefit performance-wise from having pluggable schemas in the JDBC interface for Hadoop. Fredrik On Feb 12, 2009, at 6:05 PM, Brian MacKay wrote: Amandeep, I spoke w/ one of our Oracle DBA's and he suggested changing the query statement as follows: MySql Stmt: select * from TABLE limit splitlength offset splitstart --- Oracle Stmt: select * from (select a.*,rownum rno from (your_query_here must contain order by) a where rownum = splitstart + splitlength) where rno = splitstart; This can be put into a function, but would require a type as well. - If you edit org.apache.hadoop.mapred.lib.db.DBInputFormat, getSelectQuery, it should work in Oracle protected String getSelectQuery() { ... edit to include check for driver and create Oracle Stmt return query.toString(); } Brian == On Feb 5, 2009, at 11:37 AM, Stefan Podkowinski wrote: The 0.19 DBInputFormat class implementation is IMHO only suitable for very simple queries working on only few datasets. Thats due to the fact that it tries to create splits from the query by 1) getting a count of all rows using the specified count query (huge performance impact on large tables) 2) creating splits by issuing an individual query for each split with a limit and offset parameter appended to the input sql query Effectively your input query select * from orders would become select * from orders limit splitlength offset splitstart and executed until count has been reached. I guess this is not working sql syntax for oracle. Stefan 2009/2/4 Amandeep Khurana ama...@gmail.com: Adding a semicolon gives me the error ORA-00911: Invalid character Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Wed, Feb 4, 2009 at 6:46 AM, Rasit OZDAS rasitoz...@gmail.com wrote: Amandeep, SQL command not properly ended I get this error whenever I forget the semicolon at the end. I know, it doesn't make sense, but I recommend giving it a try Rasit 2009/2/4 Amandeep Khurana ama...@gmail.com: The same query is working if I write a simple JDBC client and query the database. So, I'm probably doing something wrong in the connection settings. But the error looks to be on the query side more than the connection side. Amandeep Amandeep Khurana Computer Science Graduate Student University of California, Santa Cruz On Tue, Feb 3, 2009 at 7:25 PM, Amandeep Khurana ama...@gmail.com wrote: Thanks Kevin I couldnt get it work. Here's the error I get: bin/hadoop jar ~/dbload.jar LoadTable1 09/02/03 19:21:17 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/02/03 19:21:20 INFO mapred.JobClient: Running job: job_local_0001 09/02/03 19:21:21 INFO mapred.JobClient: map 0% reduce 0% 09/02/03 19:21:22 INFO mapred.MapTask: numReduceTasks: 0 09/02/03 19:21:24 WARN mapred.LocalJobRunner: job_local_0001 java.io.IOException: ORA-00933: SQL command not properly ended at org.apache.hadoop.mapred.lib.db.DBInputFormat.getRecordReader(DBInputFormat.java:289) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at LoadTable1.run(LoadTable1.java:130) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at LoadTable1.main(LoadTable1.java:107) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at
Using HOD to manage a production cluster
I am looking at using HOD (Hadoop On Demand) to manage a production cluster. After reading the documentation It seems that HOD is missing some things that would need to be carefully set in a production cluster. Rack Locality: HOD uses the -N 5 option and starts a cluster of N nodes. There seems to be no way to pass specific options to them individually. How can I make sure the set of servers selected selected will end up in different racks? data node blacklist/white list These are listed in a file can that file be generated? hadoop-env Can I set my these options from HOD or do I have to build them into the hadoop tar. JMX settings Can I set my these options from HOD or do I have to build them into the hadoop tar. Upgrade with non symmetric configurations: Old servers /mnt/drive1 /mnt/drive2 New Servers /mnt/drive1 /mnt/drive2 /mnt/drive3 Can HOD ship out different configuration files to different nodes? As new nodes are joining the cluster for an upgrade they may have different configurations then the old one. From reading the docs it seems like HOD is great for building on demand clusters, but may not be ideal for managing a single permanent long term cluster. Accidental cluster destruction. Sounds silly but might the wrong command take out a cluster in one swipe. Possibly block this feature. Any thoughts?
Re: Zeroconf for hadoop
Zeroconf is more focused on simplicity then security. One of the original problems that may have been fixes is that any program can announce any service. IE my laptop can announce that it is the DNS for google.com etc. I want to mention a related topic to the list. People are approaching the auto-discovery in a number of ways jira. There are a few ways I can think of to discover hadoop. A very simple way might be to publish the configuration over a web interface. I use a network storage system called gluster-fs. Gluster can be configured so the server holds the configuration for each client. If the hadoop name node held the entire configuration for all the nodes the namenode would only need to be aware of the namenode and it could retrieve its configuration from it. Having a central configuration management or a discovery system would be very useful. HOD is what I think to be the closest thing it is more of a top down deployment system.
Re: Netbeans/Eclipse plugin
On Sun, Jan 25, 2009 at 10:57 AM, vinayak katkar vinaykat...@gmail.com wrote: Any one knows Netbeans or Eclipse plugin for Hadoop Map -Reduce job. I want to make plugin for netbeans http://vinayakkatkar.wordpress.com -- Vinayak Katkar Sun Campus Ambassador Sun Microsytems,India COEP There is an ecplipse plugin. http://www.alphaworks.ibm.com/tech/mapreducetools Seems like some work is being done on netbeans https://nbhadoop.dev.java.net/ The world needs more netbeans love.
Re: Why does Hadoop need ssh access to master and slaves?
I am looking to create some RA scripts and experiment with starting hadoop via linux-ha cluster manager. Linux HA would handle restarting downed nodes and eliminate the ssh key dependency.
Re: When I system.out.println() in a map or reduce, where does it go?
Also be careful when you do this. If you are running map/reduce on a large file the map and reduce operations will be called many times. You can end up with a lot of output. Use log4j instead.
Re: File loss at Nebraska
Also it might be useful to strongly word hadoop-default.conf as many people might not know a downside exists for using 2 rather then 3 as the replication factor. Before reading this thread I would have thought 2 to be sufficient.
JDBC input/output format
Is anyone working on a JDBC RecordReader/InputFormat. I was thinking this would be very useful for sending data into mappers. Writing data to a relational database might be more application dependent but still possible.
Hadoop IP Tables configuration
All, I always run iptables on my systems. Most of the hadoop setup guides I have found skip iptables/firewall configuration. My namenode and task tracker are the same node. My current configuration is not working as I submit jobs from the namenode jobs are kicked off on the slave nodes but they fail. I suspect I do not have the right ports open, but it might be more complex, such as an RPC issue. I uploaded my iptables file to my pastebin. This way I do not have to inline it. http://paste.jointhegrid.com/10 I want to avoid writing a catch all rule, or shutting down entirely. I would like to have 1-3 configurations in the end for each hadoop component. It anyone helps me out I will contribute this to the hadoop wiki.
Re: What do you do with task logs?
We just setup a log4j server. This takes the logs off the cluster. Plus you get all the benefits of log4j http://timarcher.com/?q=node/10
Re: nagios to monitor hadoop datanodes!
All I have to say is wow! I never tried jconsole before. I have hadoop_trunk checked out and the JMX has all kinds of great information. I am going to look at how I can get JMX/cacti/and hadoop working together. Just as an FYI there are separate ENV variables for each now. If you override hadoop_ops you get a port conflict. It should be like this. export HADOOP_NAMENODE_OPTS=-Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.port=10001 Thanks Brian.
Re: LHadoop Server simple Hadoop input and output
I came up with my line of thinking after reading this article: http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data As a guy that was intrigued by the java coffee cup in 95, that now lives as a data center/noc jock/unix guy. Lets say I look at a log management process from a data center prospective. I know: Syslog is a familiar model (human readable: UDP text) INETD/XINETD is a familiar model (programs that do amazing things with STD IN/STD OUT) Variety of hardware and software I may be supporting an older Solaris 8, windows or Free BSD 5 for example. I want to be able to pipe apache custom log at HDFS, or forward syslog. That is where LHadoop (or something like it) would come into play. I am thinking to even accept raw streams and have the server side use source-host/regex to determine what file the data should go to. I want to stay light on the client side. An application that tails log files and transmits new data is another component to develop and manage. Had anyone had experience with moving this type of data?
Re: LHadoop Server simple Hadoop input and output
I had downloaded thrift and ran the example applications after the Hive meet up. It is very cool stuff. The thriftfs interface is more elegant than what I was trying to do, and that implementation is more complete. Still, someone might be interested in what I did if they want a super-light API :) I will link to http://wiki.apache.org/hadoop/HDFS-APIs from my page so people know the options.
Hive Web-UI
I was checking out this slide show. http://www.slideshare.net/jhammerb/2008-ur-tech-talk-zshao-presentation/ in the diagram a Web-UI exists. This was the first I have heard of this. Is this part of or planned to be a part of contrib/hive? I think a web interface for showing table schema and executing jobs would be very interesting. Is anyone working on something like this? If not, I have a few ideas.
Re: nagios to monitor hadoop datanodes!
The simple way would be use use nrpe and check_proc. I have never tested, but a command like 'ps -ef | grep java | grep NameNode' would be a fairly decent check. That is not very robust but it should let you know if the process is alive. You could also monitor the web interfaces associated with the different servers remotely. check_tcp!hadoop1:56070 Both the methods I suggested are quick hacks. I am going to investigate the JMX options as well and work them into cacti
Re: nagios to monitor hadoop datanodes!
That all sounds good. By 'quick hack' I meant 'check_tcp' was not good enough because an open TCP socket does not prove much. However, if the page returns useful attributes that show cluster is alive that is great and easy. Come to think of it you can navigate the dfshealth page and get useful information from it.
Re: Hadoop and security.
You bring up some valid points. This would be a great topic for a white paper. The first line of defense should be to apply inbound and outbound iptables rules. Only source IPs that have a direct need to interact with the cluster should be allowed to. The same is true with the web access. Only a range of source IP's should be allowed to access the web interfaces. You can do this through SSH tunneling. Preventing exec commands can be handled with the security manager and the sandbox. I was thinking to only allow the execution of signed jars myself but I never implemented it.
Re: Hive questions about the meta db
I am doing a lot of testing with Hive, I will be sure to add this information to the wiki once I get it going. Thus far I downloaded the same version of derby that hive uses. I have verified that the connections is up and running. ij version 10.4 ij connect 'jdbc:derby://nyhadoop1:1527/metastore_db;create=true'; ij show tables TABLE_SCHEM |TABLE_NAME|REMARKS SYS |SYSALIASES| SYS |SYSCHECKS | ... vi hive-default.conf ... property namehive.metastore.local/name valuefalse/value descriptioncontrols whether to connect to remove metastore server or open a new metastore server in Hive Client JVM/description /property property namejavax.jdo.option.ConnectionURL/name valuejdbc:derby://nyhadoop1:1527/metastore_db;create=true/value descriptionJDBC connect string for a JDBC metastore/description /property property namejavax.jdo.option.ConnectionDriverName/name valueorg.apache.derby.jdbc.ClientDriver/value descriptionDriver class name for a JDBC metastore/description /property property namehive.metastore.uris/name valuejdbc:derby://nyhadoop1:1527/metastore_db/value descriptionComma separated list of URIs of metastore servers. The first server that can be connected to will be used./description /property ... javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl org.jpox.autoCreateSchema=false org.jpox.validateTables=false org.jpox.validateColumns=false org.jpox.validateConstraints=false org.jpox.storeManagerType=rdbms org.jpox.autoCreateSchema=true org.jpox.autoStartMechanismMode=checked org.jpox.transactionIsolation=read_committed javax.jdo.option.DetachAllOnCommit=true javax.jdo.option.NontransactionalRead=true javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver javax.jdo.option.ConnectionURL=jdbc:derby://nyhadoop1:1527/metastore_db;create=true javax.jdo.option.ConnectionUserName= javax.jdo.option.ConnectionPassword= hive show tables; 08/10/02 15:17:12 INFO hive.metastore: Trying to connect to metastore with URI jdbc:derby://nyhadoop1:1527/metastore_db FAILED: Error in semantic analysis: java.lang.NullPointerException 08/10/02 15:17:12 ERROR ql.Driver: FAILED: Error in semantic analysis: java.lang.NullPointerException I must have a setting wrong. Any ideas?
Re: Hive questions about the meta db
namehive.metastore.local/name valuetrue/value Why would I set this property to true? My goal is to store the meta data in an external database. It i set this to true the metabase is created in the working directory.
Re: Hive questions about the meta db
I determined the problem once I set the log4j properties to debug. derbyclient.jar derbytools.jar does not ship with hive. As a result when you try to org.apache.derby.jdbc.ClientDriver you get an invocation target exception. The solution for this was to download the derby, and place those files in hive/lib. It is working now. Thanks!
Re: text extraction from html based on uniqueness metric
I have never tried this method. The concept came from a research paper I ran into. The goal was to detect the language of piece of text by looking at several factors. Average length of word, average length of sentence, average number of vowels in a word, etc. He used these to score and article, and it worked well in determining the language of the text. It worked well. This is a fairly basic program that you might see in Artificial Intelligence, you can create a score and try to determine what the block of text you are looking for is. The answer is not going to be perfect, and I can not imagine many out-of-the box solutions will do exactly what you need. (Just a guess) The one plus about this is that you can take html right out of the equation. I believe the java HTML tag parsers has some quick 'toText' method that will dump the text of a web page. Also your would think most online newspapers carry a NewsML XML version or RSS version of their paper.
Re: does anyone have idea on how to run multiple sequential jobs with bash script
wait and sleep are not what you are looking for. you can use 'nohup' to run a job in the background and have its output piped to a file. On Tue, Jun 10, 2008 at 5:48 PM, Meng Mao [EMAIL PROTECTED] wrote: I'm interested in the same thing -- is there a recommended way to batch Hadoop jobs together? On Tue, Jun 10, 2008 at 5:45 PM, Richard Zhang [EMAIL PROTECTED] wrote: Hello folks: I am running several hadoop applications on hdfs. To save the efforts in issuing the set of commands every time, I am trying to use bash script to run the several applications sequentially. To let the job finishes before it is proceeding to the next job, I am using wait in the script like below. sh bin/start-all.sh wait echo cluster start (bin/hadoop jar hadoop-0.17.0-examples.jar randomwriter -D test.randomwrite.bytes_per_map=107374182 rand) wait bin/hadoop jar hadoop-0.17.0-examples.jar randomtextwriter -D test.randomtextwrite.total_bytes=107374182 rand-text bin/stop-all.sh echo finished hdfs randomwriter experiment However, it always give the error like below. Does anyone have better idea on how to run the multiple sequential jobs with bash script? HadoopScript.sh: line 39: wait: pid 10 is not a child of this shell org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.mapred.JobTracker$IllegalStateException: Job tracker still initializing at org.apache.hadoop.mapred.JobTracker.ensureRunning(JobTracker.java:1722) at org.apache.hadoop.mapred.JobTracker.getNewJobId(JobTracker.java:1730) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.getNewJobId(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.getNewJobId(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:696) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at org.apache.hadoop.examples.RandomWriter.run(RandomWriter.java:276) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.RandomWriter.main(RandomWriter.java:287) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:155) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:194) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:220) -- hustlin, hustlin, everyday I'm hustlin
Re: Hadoop Distributed Virtualisation
I once asked a wise man in change of a rather large multi-datacenter service, Have you every considered virtualization? He replied, All the CPU's here are pegged at 100% They may be applications for this type of processing. I have thought about systems like this from time to time. This thinking goes in circles. Hadoop is designed for storing and processing on different hardware. Virtualization lets you split a system into sub-systems. Virtualization is great for proof of concept. For example, I have deployed this: I installed VMware with two linux systems on my windows host, I followed a hadoop multi-system-tutorial running on two vmware nodes. I was able to get the word count application working, I also confirmed that blocks were indeed being stored on both virtual systems and that processing was being shared via MAP/REDUCE. The processing however was slow, of course this is the fault of VMware. VMware has a very high emulation overhead. Xen has less overhead. LinuxVserver and OpenVZ use software virtualization (they have very little (almost no) overhead). Regardless of how much overhead, overhead is overhead. Personally I find the Vmware falls short of its promises