Re: Where are the meta data on HDFS ?
Hello Tien , There is tool in Hadoop DFS i.e fsck . I hope this will help you and serve your purpose very well. For e.g: $HADOOP_HOME/bin/hadoop fsck filename/directorie path -files -blocks -locations The above tool will display the blocks/chunks of files , locations where this blocks/chunks of files are located. Also it will display other useful information for files and directories. For more information on fsck , just refer this URL : http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#fsck Thanks , --- Peeyush On Fri, 2009-01-23 at 15:24 -0800, tienduc_dinh wrote: hi everyone, I got a question, maybe you can help me. - how can we get the meta data of a file on HDFS ? For example: If I have a file with e.g. 2 GB on HDFS, this file is split into many chunks and these chunks are distributed on many nodes. Is there any trick to know, which chunks belong to that file ? Any help will be appreciated, thanks lots. Tien Duc Dinh
Re: Where are the meta data on HDFS ?
that's what I needed ! Thank you so much. -- View this message in context: http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Debugging in Hadoop
Hello list, I am trying to add some functionality to Hadoop-core and I am having serious issues debugging it. I have searched in the list archive and still have not been able to resolve the issues. Simple question: If I want to insert LOG.INFO() statements in Hadoop code is not that as simple as modifying log4j.properties file to include the class which has the statements. For example, if I want to print out the LOG.info(I am here!) statements in MapTask. class I would add to the lo4j.properites file the following line: # Custom Logging levels . . . log4j.logger.org.apache.hadoop.mapred.MapTask=INFO This approach is clearly not working for me. What am I missing? Thank you, patektek
Hadoop 0.19 over OS X : dfs error
Hi I am trying to setup Hadoop 0.19 on OS X. Current Java Version is java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) When I am trying to format dfs using bin/hadoop dfs -format command. I am getting following errors: nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) I am not sure why this error is coming. I am having latest Java version. Can anyone help me out with this? Thanks Nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: Hadoop 0.19 over OS X : dfs error
Hi, I guess that the java on your PATH is different from the setting of your $JAVA_HOME env variable. Try $JAVA_HOME/bin/java -version? Also, there is a program called Java Preferences on each system for changing the default java version used. Craig nitesh bhatia wrote: Hi I am trying to setup Hadoop 0.19 on OS X. Current Java Version is java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) When I am trying to format dfs using bin/hadoop dfs -format command. I am getting following errors: nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) I am not sure why this error is coming. I am having latest Java version. Can anyone help me out with this? Thanks Nitesh
Re: Hadoop 0.19 over OS X : dfs error
Hi My current default settings are for java 1.6 nMac:hadoop-0.19.0 Aryan$ $JAVA_HOME/bin/java -version java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) The system is working fine with Hadoop 0.18.2. --nitesh On Sun, Jan 25, 2009 at 4:15 AM, Craig Macdonald cra...@dcs.gla.ac.ukwrote: Hi, I guess that the java on your PATH is different from the setting of your $JAVA_HOME env variable. Try $JAVA_HOME/bin/java -version? Also, there is a program called Java Preferences on each system for changing the default java version used. Craig nitesh bhatia wrote: Hi I am trying to setup Hadoop 0.19 on OS X. Current Java Version is java version 1.6.0_07 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153) Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode) When I am trying to format dfs using bin/hadoop dfs -format command. I am getting following errors: nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) Exception in thread main java.lang.UnsupportedClassVersionError: Bad version number in .class file at java.lang.ClassLoader.defineClass1(Native Method) at java.lang.ClassLoader.defineClass(ClassLoader.java:675) at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124) at java.net.URLClassLoader.defineClass(URLClassLoader.java:260) at java.net.URLClassLoader.access$100(URLClassLoader.java:56) at java.net.URLClassLoader$1.run(URLClassLoader.java:195) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:316) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280) at java.lang.ClassLoader.loadClass(ClassLoader.java:251) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374) I am not sure why this error is coming. I am having latest Java version. Can anyone help me out with this? Thanks Nitesh -- Nitesh Bhatia Dhirubhai Ambani Institute of Information Communication Technology Gandhinagar Gujarat Life is never perfect. It just depends where you draw the line. visit: http://www.awaaaz.com - connecting through music http://www.volstreet.com - lets volunteer for better tomorrow http://www.instibuzz.com - Voice opinions, Transact easily, Have fun
Re: hadoop balanceing data
did not thank about that good points I found a way to keep it from happening I set dfs.datanode.du.reserved in the config file Hairong Kuang hair...@yahoo-inc.com wrote in message news:c59f9164.ed09%hair...@yahoo-inc.com... %Remaining is much more fluctuate than %dfs used. This is because dfs shares the disks with mapred and mapred tasks may use a lot of disks temporally. So trying to keep the same %free is impossible most of the time. Hairong On 1/19/09 10:28 PM, Billy Pearson sa...@pearsonwholesale.com wrote: Why do we not use the Remaining % in place of use Used % when we are selecting datanode for new data and when running the balancer. form what I can tell we are using the use % used and we do not factor in non DFS Used at all. I see a datanode with only a 60GB hard drive fill up completely 100% before the other servers that have 130+GB hard drives get half full. Seams like Trying to keep the same % free on the drives in the cluster would be more optimal in production. I know this still may not be perfect but would be nice if we tried. Billy
Re: HDFS - millions of files in one directory?
Philip, it seems like you went through the same problems as I did, and confirmed my feeling that this is not a trivial problem. My first idea was to balance the directory tree somehow and to store the remaining metadata elsewhere, but as you say, it has limitations. I could use some solution like your specific one, but I am only surprised that this problem does not have a well-known solution, or solutions. Again, how does Google or Yahoo store the files that they have crawled? MapReduce paper says that they store them all first, that is a few billion pages. How do they do it? Raghu, if I write all files only one, is the cost the same in one directory or do I need to find the optimal directory size and when full start another bucket? Thank you, Mark On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer f...@infochimps.orgwrote: I ran in this problem, hard, and I can vouch that this is not a windows-only problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more than a few hundred thousand files in the same directory. (The operation to correct this mistake took a week to run.) That is one of several hard lessons I learned about don't write your scraper to replicate the path structure of each document as a file on disk. Cascading the directory structure works, but sucks in various other ways, and itself stops scaling after a while. What I eventually realized is that I was using the filesystem as a particularly wrongheaded document database, and that the metadata delivery of a filesystem just doesn't work for this. Since in our application the files are text and are immutable, our adhoc solution is to encode and serialize each file with all its metadata, one per line, into a flat file. A distributed database is probably the correct answer, but this is working quite well for now and even has some advantages. (No-cost replication from work to home or offline by rsync or thumb drive, for example.) flip On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Mark Kerzner wrote: But it would seem then that making a balanced directory tree would not help either - because there would be another binary search, correct? I assume, either way it would be as fast as can be :) But the cost of memory copies would be much less with a tree (when you add and delete files). Raghu. On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi rang...@yahoo-inc.com wrote: If you are adding and deleting files in the directory, you might notice CPU penalty (for many loads, higher CPU on NN is not an issue). This is mainly because HDFS does a binary search on files in a directory each time it inserts a new file. If the directory is relatively idle, then there is no penalty. Raghu. Mark Kerzner wrote: Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance a directory tree if I want to store millions of files, or can I put them all in the same directory? Thank you, Mark -- http://www.infochimps.org Connected Open Free Data
What happens in HDFS DataNode recovery?
Hi All: I elected to take a node out of one of our grids for service. Naturally HDFS recognized the loss of the DataNode and did the right stuff, fixing replication issues and ultimately delivering a clean file system. So now the node I removed is ready to go back in service. When I return it to service a bunch of files will suddenly have a replication of 4 instead of 3. My questions: 1. Will HDFS delete a copy of the data to bring replication back to 3? 2. If (1) above is yes, will it remove the copy by deleting from other nodes, or will it remove files from the returned node, or both? The motivation for asking the questions are that I have a file system which is extremely unbalanced - we recently doubled the size of the grid when a few dozen terabytes already stored on the existing nodes. I am wondering if an easy way to restore some sense of balance is to cycle through the old nodes, removing each one from service for several hours and then return it to service. Thoughts? Thanks in Advance, C G
Job failed when writing a huge file
Hi everyone, I'm using now Hadoop 0.18.0 with 1 NameNode and 4 data nodes. By writing the file bigger than the maximal free space of each data node the job is often failed. I've seen that the file is mostly written only on one node (e.g. N1) and if this node doesn't have enough space, Hadoop deletes the old chunks on node N1, tries on another node (e.g. N2) and so on. The job will be failed if the maximal retries are reached. (I don't use the script start-balancer.sh or something like that for balancing my cluster in this test.) Sometimes it works after Hadoop really spread the file across the data nodes. I think it's not so good that Hadoop writes (and deletes) the whole huge file again and again instead of spreading it. So my question is how does the write algorithm work or how can I find such information ? Any help is appreciated, thanks a lot. Tien Duc Dinh -- View this message in context: http://www.nabble.com/Job-failed-when-writing-a-huge-file-tp21647888p21647888.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: HDFS - millions of files in one directory?
I think that Google developed BigTablehttp://en.wikipedia.org/wiki/BigTable to solve this; hadoop's HBase, or any of the myriad other distributed/document databases should work depending on need: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/ http://www.mail-archive.com/core-user@hadoop.apache.org/msg07011.html Heretrix http://en.wikipedia.org/wiki/Heritrix, Nutchhttp://en.wikipedia.org/wiki/Nutch, others use the ARC file format http://www.archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml These of course are industrial strength tools (and many of their authors are here in the room with us :) The only question with those tools is whether their might exceeds your needs. There's some oddball project out there that does peer-to-peer something something scraping but I can't find it anywhere in my bookmarks. I don't recall whether they're file-backed or DB-backed. If you, like us, want something more modest and targeted there is the recently-released python-toolkit http://lucasmanual.com/mywiki/DataHub I haven't looked at it to see if they've used it at scale. We infochimps are working right now to clean up and organize for initial release our own Infinite Monkeywrench, a homely but effective toolkit for gathering and munging datasets. (Those stupid little one-off scripts you write and throw away? A Ruby toolkit to make them slightly less annoying.) We frequently use it for directed scraping of APIs and websites. If you're willing to deal with pre-release code that's never strayed far from the machines of the guys what wrote it I can point you to what we have. I think I was probably too tough on bundling into files. If things are immutable, and only treated in bulk, and are easily and reversibly serialized, bundling many documents into a file is probably good. As I said, our toolkit uses flat text files, with the advantages of simplicity and the downside of ad hoc-ness. Storing into the ARC format lets you use the tools in the Big Scraper ecosystem, but obvs. you'd need to convert out to use with other things, possibly returning you to this same question. If you need to grab arbitrary subsets of the data, and the one set of locality tradeoffs is better than the other set of locality tradeoffs, or you need better metadata management than bundled-into-file gives you then I think that's why those distributed/document-type databases got invented. flip On Sat, Jan 24, 2009 at 7:21 PM, Mark Kerzner markkerz...@gmail.com wrote: Philip, it seems like you went through the same problems as I did, and confirmed my feeling that this is not a trivial problem. My first idea was to balance the directory tree somehow and to store the remaining metadata elsewhere, but as you say, it has limitations. I could use some solution like your specific one, but I am only surprised that this problem does not have a well-known solution, or solutions. Again, how does Google or Yahoo store the files that they have crawled? MapReduce paper says that they store them all first, that is a few billion pages. How do they do it? Raghu, if I write all files only one, is the cost the same in one directory or do I need to find the optimal directory size and when full start another bucket? Thank you, Mark On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer f...@infochimps.orgwrote: I ran in this problem, hard, and I can vouch that this is not a windows-only problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more than a few hundred thousand files in the same directory. (The operation to correct this mistake took a week to run.) That is one of several hard lessons I learned about don't write your scraper to replicate the path structure of each document as a file on disk. Cascading the directory structure works, but sucks in various other ways, and itself stops scaling after a while. What I eventually realized is that I was using the filesystem as a particularly wrongheaded document database, and that the metadata delivery of a filesystem just doesn't work for this. Since in our application the files are text and are immutable, our adhoc solution is to encode and serialize each file with all its metadata, one per line, into a flat file. A distributed database is probably the correct answer, but this is working quite well for now and even has some advantages. (No-cost replication from work to home or offline by rsync or thumb drive, for example.) flip On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com wrote: Mark Kerzner wrote: But it would seem then that making a balanced directory tree would not help either - because there would be another binary search, correct? I assume, either way it would be as fast as can be :) But the cost of memory copies would be much less with a tree (when you
Re: What happens in HDFS DataNode recovery?
The blocks will be invalidated on the returned to service datanode. If you want to save your namenode and network a lot of work, wipe the hdfs block storage directory before returning the Datanode to service. dfs.data.dir will be the directory, most likley the value is ${hadoop.tmp.dir}/dfs/data Jason - Ex Attributor On Sat, Jan 24, 2009 at 6:19 PM, C G parallel...@yahoo.com wrote: Hi All: I elected to take a node out of one of our grids for service. Naturally HDFS recognized the loss of the DataNode and did the right stuff, fixing replication issues and ultimately delivering a clean file system. So now the node I removed is ready to go back in service. When I return it to service a bunch of files will suddenly have a replication of 4 instead of 3. My questions: 1. Will HDFS delete a copy of the data to bring replication back to 3? 2. If (1) above is yes, will it remove the copy by deleting from other nodes, or will it remove files from the returned node, or both? The motivation for asking the questions are that I have a file system which is extremely unbalanced - we recently doubled the size of the grid when a few dozen terabytes already stored on the existing nodes. I am wondering if an easy way to restore some sense of balance is to cycle through the old nodes, removing each one from service for several hours and then return it to service. Thoughts? Thanks in Advance, C G