Re: HDFS - millions of files in one directory?
System with: 1 billion small files. Namenode will need to maintain the data-structure for all those files. System will have atleast 1 block per file. And if u have replication factor set to 3, the system will have 3 billion blocks. Now , if you try to read all these files in a job , you will be making as many as 1 billion socket connections to get these blocks. (Big Brothers, correct me if I m wrong) Datanodes routinely check for available disk space and collect block reports. These operations are directly dependent on number of blocks on a datanode. Getting all data in one file, avoids all this unnecessary IO and memory occupied by namenode Number of maps in map-reduce job are based on number of blocks. In case of multiple files, we will have a large number of map-tasks. -Sagar Mark Kerzner wrote: Carfield, you might be right, and I may be able to combine them in one large file. What would one use for a delimiter, so that it would never be encountered in normal binary files? Performance does matter (rarely it doesn't). What are the differences in performance between using multiple files and one large file? I would guess that one file should in fact give better hardware/OS performance, because it is more predictable and allows buffering. thank you, Mark On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim wrote: Really? I thought any file can be combines as long as you can figure out an delimiter is ok, and you really cannot have some delimiters? Like "X"? And in the worst case, or if performance is not really a matter, may be just encode all binary to and from ascii? On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner wrote: Yes, flip suggested such solution, but his files are text, so he could combine them all in a large text file, with each lined representing initial files. My files, however, are binary, so I do not see how I could combine them. However, since my numbers are limited by about 1 billion files total, I should be OK to put them all in a few directories with under, say, 10,000 files each. Maybe a little balanced tree, but 3-4 four levels should suffice. Thank you, Mark On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim Possible simple having a file large in size instead of having a lot of small files? On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark
Re: decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)
Once the nodes are listed as dead, if you still have the host names in your conf/exclude file, remove the entries and then run hadoop dfsadmin -refreshNodes. This works for us on our cluster. -paul On Tue, Jan 27, 2009 at 5:08 PM, Bill Au wrote: > I was able to decommission a datanode successfully without having to stop > my > cluster. But I noticed that after a node has been decommissioned, it shows > up as a dead node in the web base interface to the namenode (ie > dfshealth.jsp). My cluster is relatively small and losing a datanode will > have performance impact. So I have a need to monitor the health of my > cluster and take steps to revive any dead datanode in a timely fashion. So > is there any way to altogether "get rid of" any decommissioned datanode > from > the web interace of the namenode? Or is there a better way to monitor the > health of the cluster? > > Bill >
decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)
I was able to decommission a datanode successfully without having to stop my cluster. But I noticed that after a node has been decommissioned, it shows up as a dead node in the web base interface to the namenode (ie dfshealth.jsp). My cluster is relatively small and losing a datanode will have performance impact. So I have a need to monitor the health of my cluster and take steps to revive any dead datanode in a timely fashion. So is there any way to altogether "get rid of" any decommissioned datanode from the web interace of the namenode? Or is there a better way to monitor the health of the cluster? Bill
Re: DBOutputFormat and auto-generated keys
On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva wrote: > Is it possible to obtain auto-generated IDs when writing data using > DBOutputFormat? > > For example, is it possible to write Mapper which stores records in DB > and returns auto-generated > IDs of these records? ... > which I would like to store in normalized for in two tables. First > table will store > keys (string). Each key will have unique int id auto-generated by mysql. > > Second table will have (key_id,value) pairs, key_id being foreign key, > pointing to first table. > A mapper has to have one output format, and that output format can't pass any data into the map, so that approach won't work. DBOutputFormat doesn't provide any way to do it either. If you wanted to add this kind of functionality, you would need to write your own output format, which probably wouldn't look much like DBOutputFormat, which would be aware of your foreign keys. It would quickly get very complicated. One possibility that comes to mind is writing a "HibernateOutputFormat" or similar, which would give you a way to express the relationships between tables, leaving your only task to hook up your persistence logic to a hadoop output format. I had a similar problem with writing out reports to be used by a Rails app, and solved it by restructuring things so that I don't need to write to two tables from the same map task.
Re: Using HDFS for common purpose
You may also want to have a look at this to reach a decision based on your needs: http://www.swaroopch.com/notes/Distributed_Storage_Systems Jim On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky wrote: > Rasit, > > What kind of data will you be storing on Hbase or directly on HDFS? Do you > aim to use it as a data source to do some key/value lookups for small > strings/numbers or do you want to store larger files labeled with some sort > of a key and retrieve them during a map reduce run? > > Jim > > > On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray wrote: > >> Perhaps what you are looking for is HBase? >> >> http://hbase.org >> >> HBase is a column-oriented, distributed store that sits on top of HDFS and >> provides random access. >> >> JG >> >> > -Original Message- >> > From: Rasit OZDAS [mailto:rasitoz...@gmail.com] >> > Sent: Tuesday, January 27, 2009 1:20 AM >> > To: core-user@hadoop.apache.org >> > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr; >> > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr; >> > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr >> > Subject: Using HDFS for common purpose >> > >> > Hi, >> > I wanted to ask, if HDFS is a good solution just as a distributed db >> > (no >> > running jobs, only get and put commands) >> > A review says that "HDFS is not designed for low latency" and besides, >> > it's >> > implemented in Java. >> > Do these disadvantages prevent us using it? >> > Or could somebody suggest a better (faster) one? >> > >> > Thanks in advance.. >> > Rasit >> >> >
Re: Using HDFS for common purpose
Rasit, What kind of data will you be storing on Hbase or directly on HDFS? Do you aim to use it as a data source to do some key/value lookups for small strings/numbers or do you want to store larger files labeled with some sort of a key and retrieve them during a map reduce run? Jim On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray wrote: > Perhaps what you are looking for is HBase? > > http://hbase.org > > HBase is a column-oriented, distributed store that sits on top of HDFS and > provides random access. > > JG > > > -Original Message- > > From: Rasit OZDAS [mailto:rasitoz...@gmail.com] > > Sent: Tuesday, January 27, 2009 1:20 AM > > To: core-user@hadoop.apache.org > > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr; > > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr; > > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr > > Subject: Using HDFS for common purpose > > > > Hi, > > I wanted to ask, if HDFS is a good solution just as a distributed db > > (no > > running jobs, only get and put commands) > > A review says that "HDFS is not designed for low latency" and besides, > > it's > > implemented in Java. > > Do these disadvantages prevent us using it? > > Or could somebody suggest a better (faster) one? > > > > Thanks in advance.. > > Rasit > >
Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0
Yes, I did run fsck after upgrade. No error message. Everything is "OK". yy Brian Bockelman To core-user@hadoop.apache.org 01/27/2009 08:57 cc AM Subject Re: files are inaccessible after Please respond to HDFS upgrade from 0.18.1 to 1.19.0 core-u...@hadoop. apache.org Hey YY, At a more basic level -- have you run fsck on that file? What were the results? Brian On Jan 27, 2009, at 10:54 AM, Bill Au wrote: > Did you start your namenode with the -upgrade after upgrading from > 0.18.1 to > 0.19.0? > > Bill > > On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian > wrote: > >> >> >> Hi, >> >> I just upgraded hadoop from 0.18.1 to 0.19.0 following the >> instructions on >> http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run >> fsck, >> everything seems fine. All the files can be listed in hdfs and the >> sizes >> are also correct. But when a mapreduce job tries to read the files as >> input, the following error messages are returned for some of the >> files: >> >> java.io.IOException: Could not obtain block: >> blk_-2827537120880440835_1131 >> file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp >>at org.apache.hadoop.hdfs.DFSClient >> $DFSInputStream.chooseDataNode(DFSClient.java:1708) >>at org.apache.hadoop.hdfs.DFSClient >> $DFSInputStream.blockSeekTo >> (DFSClient.java:1536) >>at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read >> (DFSClient.java:1663) >>at java.io.DataInputStream.read(DataInputStream.java:150) >>at java.io.ObjectInputStream$PeekInputStream.read >> (ObjectInputStream.java:2283) >>at java.io.ObjectInputStream$PeekInputStream.readFully >> (ObjectInputStream.java:2296) >>at java.io.ObjectInputStream >> $BlockDataInputStream.readShort >> (ObjectInputStream.java:2767) >>at java.io.ObjectInputStream.readStreamHeader >> (ObjectInputStream.java:798) >>at java.io.ObjectInputStream.(ObjectInputStream.java:298) >>at >> >> emailanalytics.importer.parallelimport.EmailContentRecordReader. >> (EmailContentRecordReader.java:32) >> >>at >> emailanalytics >> .importer.parallelimport.EmailContentFormat.getRecordReader >> (EmailContentFormat.java:20) >>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) >>at org.apache.hadoop.mapred.Child.main(Child.java:155) >> >> I also tried to browse these files through the HDFS web interface, >> java.io.EOFException is returned. >> >> Is there any way to recover the files? >> >> Thanks very much, >> >> YY
Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0
Yes, I did that. But there is some error message that asks me to rollback first. So, I ended up a -rollback first and then and -upgrade. yy Bill Au To core-user@hadoop.apache.org 01/27/2009 08:54 cc AM Subject Re: files are inaccessible after Please respond to HDFS upgrade from 0.18.1 to 1.19.0 core-u...@hadoop. apache.org Did you start your namenode with the -upgrade after upgrading from 0.18.1 to 0.19.0? Bill On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian wrote: > > > Hi, > > I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on > http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck, > everything seems fine. All the files can be listed in hdfs and the sizes > are also correct. But when a mapreduce job tries to read the files as > input, the following error messages are returned for some of the files: > > java.io.IOException: Could not obtain block: blk_-2827537120880440835_1131 > file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp > at org.apache.hadoop.hdfs.DFSClient > $DFSInputStream.chooseDataNode(DFSClient.java:1708) > at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.blockSeekTo > (DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read > (DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:150) > at java.io.ObjectInputStream$PeekInputStream.read > (ObjectInputStream.java:2283) > at java.io.ObjectInputStream$PeekInputStream.readFully > (ObjectInputStream.java:2296) > at java.io.ObjectInputStream$BlockDataInputStream.readShort > (ObjectInputStream.java:2767) > at java.io.ObjectInputStream.readStreamHeader > (ObjectInputStream.java:798) > at java.io.ObjectInputStream.(ObjectInputStream.java:298) > at > > emailanalytics.importer.parallelimport.EmailContentRecordReader.(EmailContentRecordReader.java:32) > > at > emailanalytics.importer.parallelimport.EmailContentFormat.getRecordReader > (EmailContentFormat.java:20) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) > at org.apache.hadoop.mapred.Child.main(Child.java:155) > > I also tried to browse these files through the HDFS web interface, > java.io.EOFException is returned. > > Is there any way to recover the files? > > Thanks very much, > > YY
RE: Using HDFS for common purpose
Perhaps what you are looking for is HBase? http://hbase.org HBase is a column-oriented, distributed store that sits on top of HDFS and provides random access. JG > -Original Message- > From: Rasit OZDAS [mailto:rasitoz...@gmail.com] > Sent: Tuesday, January 27, 2009 1:20 AM > To: core-user@hadoop.apache.org > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr; > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr; > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr > Subject: Using HDFS for common purpose > > Hi, > I wanted to ask, if HDFS is a good solution just as a distributed db > (no > running jobs, only get and put commands) > A review says that "HDFS is not designed for low latency" and besides, > it's > implemented in Java. > Do these disadvantages prevent us using it? > Or could somebody suggest a better (faster) one? > > Thanks in advance.. > Rasit
Re: HDFS - millions of files in one directory?
Tossing one more on this king of all threads: Stuart Sierra of AltLaw wrote a nice little tool to serialize tar.bz2 files into SequenceFile, with filename as key and its contents a BLOCK-compressed blob. http://stuartsierra.com/2008/04/24/a-million-little-files flip On Mon, Jan 26, 2009 at 3:20 PM, Mark Kerzner wrote: > Jason, this is awesome, thank you. > By the way, is there a book or manual with "best practices?" > > On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop >wrote: > > > Sequence files rock, and you can use the > > * > > bin/hadoop dfs -text FILENAME* command line tool to get a toString level > > unpacking of the sequence file key,value pairs. > > > > If you provide your own key or value classes, you will need to implement > a > > toString method to get some use out of this. Also, your class path will > > need > > to include the jars with your custom key/value classes. > > > > HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME* > > > > > > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner > > wrote: > > > > > Thank you, Doug, then all is clear in my head. > > > Mark > > > > > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting > > wrote: > > > > > > > Mark Kerzner wrote: > > > > > > > >> Okay, I am convinced. I only noticed that Doug, the originator, was > > not > > > >> happy about it - but in open source one has to give up control > > > sometimes. > > > >> > > > > > > > > I think perhaps you misunderstood my remarks. My point was that, if > > you > > > > looked to Nutch's Content class for an example, it is, for historical > > > > reasons, somewhat more complicated than it needs to be and is thus a > > less > > > > than perfect example. But using SequenceFile to store web content is > > > > certainly a best practice and I did not mean to imply otherwise. > > > > > > > > Doug > > > > > > > > > > -- http://www.infochimps.org Connected Open Free Data
Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0
Hey YY, At a more basic level -- have you run fsck on that file? What were the results? Brian On Jan 27, 2009, at 10:54 AM, Bill Au wrote: Did you start your namenode with the -upgrade after upgrading from 0.18.1 to 0.19.0? Bill On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian wrote: Hi, I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck, everything seems fine. All the files can be listed in hdfs and the sizes are also correct. But when a mapreduce job tries to read the files as input, the following error messages are returned for some of the files: java.io.IOException: Could not obtain block: blk_-2827537120880440835_1131 file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.chooseDataNode(DFSClient.java:1708) at org.apache.hadoop.hdfs.DFSClient $DFSInputStream.blockSeekTo (DFSClient.java:1536) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read (DFSClient.java:1663) at java.io.DataInputStream.read(DataInputStream.java:150) at java.io.ObjectInputStream$PeekInputStream.read (ObjectInputStream.java:2283) at java.io.ObjectInputStream$PeekInputStream.readFully (ObjectInputStream.java:2296) at java.io.ObjectInputStream $BlockDataInputStream.readShort (ObjectInputStream.java:2767) at java.io.ObjectInputStream.readStreamHeader (ObjectInputStream.java:798) at java.io.ObjectInputStream.(ObjectInputStream.java:298) at emailanalytics.importer.parallelimport.EmailContentRecordReader. (EmailContentRecordReader.java:32) at emailanalytics .importer.parallelimport.EmailContentFormat.getRecordReader (EmailContentFormat.java:20) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) at org.apache.hadoop.mapred.Child.main(Child.java:155) I also tried to browse these files through the HDFS web interface, java.io.EOFException is returned. Is there any way to recover the files? Thanks very much, YY
Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0
Did you start your namenode with the -upgrade after upgrading from 0.18.1 to 0.19.0? Bill On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian wrote: > > > Hi, > > I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on > http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck, > everything seems fine. All the files can be listed in hdfs and the sizes > are also correct. But when a mapreduce job tries to read the files as > input, the following error messages are returned for some of the files: > > java.io.IOException: Could not obtain block: blk_-2827537120880440835_1131 > file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp > at org.apache.hadoop.hdfs.DFSClient > $DFSInputStream.chooseDataNode(DFSClient.java:1708) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo > (DFSClient.java:1536) > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read > (DFSClient.java:1663) > at java.io.DataInputStream.read(DataInputStream.java:150) > at java.io.ObjectInputStream$PeekInputStream.read > (ObjectInputStream.java:2283) > at java.io.ObjectInputStream$PeekInputStream.readFully > (ObjectInputStream.java:2296) > at java.io.ObjectInputStream$BlockDataInputStream.readShort > (ObjectInputStream.java:2767) > at java.io.ObjectInputStream.readStreamHeader > (ObjectInputStream.java:798) > at java.io.ObjectInputStream.(ObjectInputStream.java:298) > at > > emailanalytics.importer.parallelimport.EmailContentRecordReader.(EmailContentRecordReader.java:32) > > at > emailanalytics.importer.parallelimport.EmailContentFormat.getRecordReader > (EmailContentFormat.java:20) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321) > at org.apache.hadoop.mapred.Child.main(Child.java:155) > > I also tried to browse these files through the HDFS web interface, > java.io.EOFException is returned. > > Is there any way to recover the files? > > Thanks very much, > > YY
Funding opportunities for Academic Research on/near Hadoop
This is a little note to advise universities working on Hadoop related projects, that they may be able to get some money and cluster time for some fun things http://www.hpl.hp.com/open_innovation/irp/ "The HP Labs Innovation Research Program is designed to create opportunities -- at colleges, universities and research institutes around the world -- for breakthrough collaborative research with HP. HP Labs is proud to announce the 2009 Innovation Research Program (IRP). Through this open Call for Proposals, we are soliciting your best ideas on a range of topics with the goal of establishing new research collaborations. Proposals will be invited against targeted IRP Research Topics, and will be accepted via an online submission tool. They will be reviewed by HP Labs scientists for alignment with the selected research topic and impact of the proposed research. Awards under the 2009 HP Labs Innovation Research Program are primarily intended to provide financial support for a graduate student to assist the Principal Investigator conducting a collaborative research project with HP Labs. Consequently, awards will provide cash support for one year in the range of USD $50,000 to $75,000, including any overhead." If you look at the research topics there is a PDF file listing topics of interest, of which three general categories may be of interest: http://www.hpl.hp.com/open_innovation/irp/topics_2009.html 1. "Intelligent Infrastructure" - very large scale storage systems, management, etc. 2. Sustainability -especially sustainable datacentres: how to measure application power consumption, and improve it; how to include knowledge of the physical infrastructure in computation 3. "Cloud" - Large-scale computing frameworks, Data management and security, Federation of heterogeneous cloud sites, Programming tools and mash-ups, Complex event processing and management, Massive-Scale Data Analytics, Cloud monitoring and management. If you look at that Cloud topic, not only does Hadoop-related work seem to fit in, the call for proposals is fairly explicit in mentioning the ecosystems suitability as a platform for your work. Which makes sense, as it is the only very-large-scale data-centric computing platform out there for which the source code is freely available. Yet also, because it is open source, it is a place where, university permitting, your research can be contributed back to the community, and used by grateful users the world over. What is also interesting is that little line at the bottom, "We encourage investigators to utilize the capabilities in the Open Cirrus testbed as well as to share their experience, data, and algorithms with other researchers using the testbed. " Which implies that cluster time on the new cross-company, cross-university homogenous datacentre test bed should be available to test your ideas. If you are at university, have a look at the proposals and see if you can come up with a proposal for innovative work in this area. The timescales are fairly aggressive -that is to ensure that proposers will know early on whether or not they were successful, and the money will be in their University's hands for the next academic year. -Steve (for followup queries, follow the links on the site or email me direct; I am vaguely involved in some of this)
[ANNOUNCE] Registration for ApacheCon Europe 2009 is now open!
All, I'm broadcasting this to all of the Hadoop dev and users lists, however, in the future I'll only send cross-subproject announcements to gene...@hadoop.apache.org. Please subscribe over there too! It is very low traffic. Anyways, ApacheCon Europe is coming up in March. There are a range of Hadoop talks being given: Introduction to Hadoop by Owen O'Malley Hadoop Map/Reduce: Tuning and Debugging by Arun Murthy Pig - Making Hadoop Easy by Olga Natkovich Running Hadoop in the Cloud by Tom White Architectures for the Cloud by Steve Loughran Configuring Hadoop for Grid Services by Allen Wittenauer Dynamic Hadoop Clusters by Steve Loughran HBasics: An Introduction to Hadoop's Bid Data Database by Michael Stack Hadoop Tools and Tricks for Data Pipelines by Christophe Bisciglia Introducing Mahout: Apache Machine Learning by Grant Ingersoll -- Owen Begin forwarded message: From: Shane Curcuru Date: January 27, 2009 6:15:25 AM PST Subject: [ANN] Registration for ApacheCon Europe 2009 is now open! PMC moderators - please forward the below to any appropriate dev@ or users@ lists so your larger community can hear about ApacheCon Europe. Remember, ACEU09 has scheduled sessions spanning the breadth of the ASF's projects, subprojects, and podlings, including at least: ActiveMQ, SerivceMix, CXF, Axis2, Hadoop, Felix, Sling, Maven, Struts, Roller, Shindig, Geronimo, Lucene, Solr, BSF, Mina, Directory, Tomcat, httpd, Mahout, Bayeux, CouchDB, AntUnit, Jackrabbit, Archiva, Wicket, POI, Pig, Synapse, Droids, Continuum. ApacheCon EU 2009 registration is now open! 23-27 March -- Mövenpick Hotel, Amsterdam, Netherlands http://www.eu.apachecon.com/ Registration for ApacheCon Europe 2009 is now open - act before early bird prices expire 6 February. Remember to book a room at the Mövenpick and use the Registration Code: Special package attendees for the conference registration, and get 150 Euros off your full conference registration. Lower Costs - Thanks to new VAT tax laws, our prices this year are 19% lower than last year in Europe! We've also negotiated a Mövenpick rate of a maximum of 155 Euros per night for attendees in our room block. Quick Links: http://xrl.us/aceu09sp See the schedule http://xrl.us/aceu09hp Get your hotel room http://xrl.us/aceu09rp Register for the conference Other important notes: - Geeks for Geeks is a new mini-track where we can feature advanced technical content from project committers. And our Hackathon on Monday and Tuesday is open to all attendees - be sure to check it off in your registration. - The Call for Papers for ApacheCon US 2009, held 2-6 November 2009 in Oakland, CA, is open through 28 February, so get your submissions in now. This ApacheCon will feature special events with some of the ASF's original founders in celebration of the 10th anniversary of The Apache Software Foundation. http://www.us.apachecon.com/c/acus2009/ - Interested in sponsoring the ApacheCon conferences? There are plenty of sponsor packages available - please contact Delia Frees at de...@apachecon.com for further information. == ApacheCon EU 2008: A week of Open Source at it's best! Hackathon - open to all! | Geeks for Geeks | Lunchtime Sessions In-Depth Trainings | Multi-Track Sessions | BOFs | Business Panel Lightning Talks | Receptions | Fast Feather Track | Expo... and more! - Shane Curcuru, on behalf of Noirin Shirley, Conference Lead, and the whole ApacheCon Europe 2009 Team http://www.eu.apachecon.com/ 23-27 March -- Amsterdam, Netherlands
Number of records in a MapFile
Is there a way to programatically get the number of records in a MapFile without doing a complete scan?
Re: Where are the meta data on HDFS ?
Hi Tien, Configuration config = new Configuration(true); config.addResource(new Path("/etc/hadoop-0.19.0/conf/hadoop-site.xml")); FileSystem fileSys = FileSystem.get(config); BlockLocation[] locations = fileSys.getFileBlockLocations(. I copied some lines of my code, it can also help if you prefer using the API. It has other useful features (methods) as well. http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html 2009/1/24 tienduc_dinh > > that's what I needed ! > > Thank you so much. > -- > View this message in context: > http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- M. Raşit ÖZDAŞ
Re: Zeroconf for hadoop
Edward Capriolo wrote: Zeroconf is more focused on simplicity then security. One of the original problems that may have been fixes is that any program can announce any service. IE my laptop can announce that it is the DNS for google.com etc. -1 to zeroconf as it is way too chatty. Every DNS lookup is mcast, in a busy network a lot of CPU time is spent discarding requests. Nor does it handle failure that well. It's OK on a home LAN to find a music player, but not what you want for a HA infrastructure in the datacentre, Our LAN discovery tool -Anubis -uses mcast only to do the initial discovery, then they have voting and things to select a nominated server that everyone just unicasts too at that point; failure of that node/network partition triggers a rebinding. See: http://wiki.smartfrog.org/wiki/display/sf/Anubis ; the paper discusses some of the fun you have, though that paper doesn't also include clock drift issue you can encounter when running Xen or VMWare-hosted nodes. I want to mention a related topic to the list. People are approaching the auto-discovery in a number of ways jira. There are a few ways I can think of to discover hadoop. A very simple way might be to publish the configuration over a web interface. I use a network storage system called gluster-fs. Gluster can be configured so the server holds the configuration for each client. If the hadoop name node held the entire configuration for all the nodes the namenode would only need to be aware of the namenode and it could retrieve its configuration from it. Having a central configuration management or a discovery system would be very useful. HOD is what I think to be the closest thing it is more of a top down deployment system. Allen is a fan of a well managed cluster; he pushes out Hadoop as RPMs via PXE and Kickstart and uses LDAP as the central CM tool. I am currently exploring bringing up virtual clusters by * putting the relevant RPMs out to all nodes; same files/conf for every node, * having custom configs for Namenode and job tracker; everything else becomes a Datanode with a task tracker bound to the masters. I will start worrying about discovery afterwards, because without the ability for the Job Tracker or Namenode to do failover to a fallback Job Tracker or Namenode, you don't really need so much in the way of dynamic cluster binding. -steve
Re: Interrupting JobClient.runJob
Edwin wrote: Hi I am looking for a way to interrupt a thread that entered JobClient.runJob(). The runJob() method keep polling the JobTracker until the job is completed. After reading the source code, I know that the InterruptException is caught in runJob(). Thus, I can't interrupt it using Thread.interrupt() call. Is there anyway I can interrupt a polling thread without terminating the job? If terminating the job is the only way to escape, how can I terminate the current job? Thank you very much. Regards Edwin Yes. there is noway to stop the client from polling. If you want to Stop client thread, use +c or kill the client process itself. You can kill a job using the command: bin/hadoop job -kill -Amareshwari
Interrupting JobClient.runJob
Hi I am looking for a way to interrupt a thread that entered JobClient.runJob(). The runJob() method keep polling the JobTracker until the job is completed. After reading the source code, I know that the InterruptException is caught in runJob(). Thus, I can't interrupt it using Thread.interrupt() call. Is there anyway I can interrupt a polling thread without terminating the job? If terminating the job is the only way to escape, how can I terminate the current job? Thank you very much. Regards Edwin
Using HDFS for common purpose
Hi, I wanted to ask, if HDFS is a good solution just as a distributed db (no running jobs, only get and put commands) A review says that "HDFS is not designed for low latency" and besides, it's implemented in Java. Do these disadvantages prevent us using it? Or could somebody suggest a better (faster) one? Thanks in advance.. Rasit