Re: Where are the meta data on HDFS ?

2009-01-24 Thread Peeyush Bishnoi
Hello Tien ,

There is tool in Hadoop DFS i.e fsck . I hope this will help you and
serve your purpose very well.

For e.g:
$HADOOP_HOME/bin/hadoop fsck filename/directorie path -files -blocks
-locations 

The above tool will display the blocks/chunks of files , locations where
this blocks/chunks of files are located. Also it will display other
useful information for files and directories. 

For more information on fsck , just refer this URL :
http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#fsck 


Thanks ,
---
Peeyush

On Fri, 2009-01-23 at 15:24 -0800, tienduc_dinh wrote:

 hi everyone,
 
 I got a question, maybe you can help me.
 
 - how can we get the meta data of a file on HDFS ? 
 
 For example:  If I have a file with e.g. 2 GB on HDFS, this file is split
 into many chunks and these chunks are distributed on many nodes. Is there
 any trick to know, which chunks belong to that file ?
 
 Any help will be appreciated, thanks lots.
 
 Tien Duc Dinh


Re: Where are the meta data on HDFS ?

2009-01-24 Thread tienduc_dinh

that's what I needed !

Thank you so much.
-- 
View this message in context: 
http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Debugging in Hadoop

2009-01-24 Thread patektek
Hello list, I am trying to add some functionality to Hadoop-core and I am
having serious issues
debugging it. I have searched in the list archive and still have not been
able to resolve the issues.

Simple question:
If I want to insert LOG.INFO() statements in Hadoop code is not that as
simple as  modifying
log4j.properties file to include the class which has the statements. For
example, if I want to
print out the LOG.info(I am here!) statements in MapTask. class
I would add to the lo4j.properites file the following line:


# Custom Logging levels
.
.
.
log4j.logger.org.apache.hadoop.mapred.MapTask=INFO

This approach is clearly not working for me.
What am I missing?

Thank you,
patektek


Hadoop 0.19 over OS X : dfs error

2009-01-24 Thread nitesh bhatia
Hi
I am trying to setup Hadoop 0.19 on OS X. Current Java Version is

java version 1.6.0_07
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

When I am trying to format dfs  using bin/hadoop dfs -format command. I am
getting following errors:

nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)


I am not sure why this error is coming. I am having latest Java version. Can
anyone help me out with this?

Thanks
Nitesh

-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: Hadoop 0.19 over OS X : dfs error

2009-01-24 Thread Craig Macdonald

Hi,

I guess that the java on your PATH is different from the setting of your 
$JAVA_HOME env variable.

Try $JAVA_HOME/bin/java -version?

Also, there is a program called Java Preferences on each system for 
changing the default java version used.


Craig

nitesh bhatia wrote:

Hi
I am trying to setup Hadoop 0.19 on OS X. Current Java Version is

java version 1.6.0_07
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

When I am trying to format dfs  using bin/hadoop dfs -format command. I am
getting following errors:

nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)


I am not sure why this error is coming. I am having latest Java version. Can
anyone help me out with this?

Thanks
Nitesh

  




Re: Hadoop 0.19 over OS X : dfs error

2009-01-24 Thread nitesh bhatia
Hi
My current default settings are  for java 1.6

nMac:hadoop-0.19.0 Aryan$ $JAVA_HOME/bin/java -version
java version 1.6.0_07
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)


The system is working fine with Hadoop 0.18.2.

--nitesh

On Sun, Jan 25, 2009 at 4:15 AM, Craig Macdonald cra...@dcs.gla.ac.ukwrote:

 Hi,

 I guess that the java on your PATH is different from the setting of your
 $JAVA_HOME env variable.
 Try $JAVA_HOME/bin/java -version?

 Also, there is a program called Java Preferences on each system for
 changing the default java version used.

 Craig


 nitesh bhatia wrote:

 Hi
 I am trying to setup Hadoop 0.19 on OS X. Current Java Version is

 java version 1.6.0_07
 Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
 Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

 When I am trying to format dfs  using bin/hadoop dfs -format command. I
 am
 getting following errors:

 nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
 Exception in thread main java.lang.UnsupportedClassVersionError: Bad
 version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
 Exception in thread main java.lang.UnsupportedClassVersionError: Bad
 version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
 java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)


 I am not sure why this error is coming. I am having latest Java version.
 Can
 anyone help me out with this?

 Thanks
 Nitesh







-- 
Nitesh Bhatia
Dhirubhai Ambani Institute of Information  Communication Technology
Gandhinagar
Gujarat

Life is never perfect. It just depends where you draw the line.

visit:
http://www.awaaaz.com - connecting through music
http://www.volstreet.com - lets volunteer for better tomorrow
http://www.instibuzz.com - Voice opinions, Transact easily, Have fun


Re: hadoop balanceing data

2009-01-24 Thread Billy Pearson

did not thank about that good points
I found a way to keep it from happening
I set dfs.datanode.du.reserved in the config file


Hairong Kuang hair...@yahoo-inc.com wrote in 
message news:c59f9164.ed09%hair...@yahoo-inc.com...
%Remaining is much more fluctuate than %dfs used. This is because dfs 
shares
the disks with mapred and mapred tasks may use a lot of disks temporally. 
So

trying to keep the same %free is impossible most of the time.

Hairong


On 1/19/09 10:28 PM, Billy Pearson 
sa...@pearsonwholesale.com wrote:



Why do we not use the Remaining % in place of use Used % when we are
selecting datanode for new data and when running the balancer.
form what I can tell we are using the use % used and we do not factor in 
non

DFS Used at all.
I see a datanode with only a 60GB hard drive fill up completely 100% 
before

the other servers that have 130+GB hard drives get half full.
Seams like Trying to keep the same % free on the drives in the cluster 
would

be more optimal in production.
I know this still may not be perfect but would be nice if we tried.

Billy










Re: HDFS - millions of files in one directory?

2009-01-24 Thread Mark Kerzner
Philip,

it seems like you went through the same problems as I did, and confirmed my
feeling that this is not a trivial problem. My first idea was to balance the
directory tree somehow and to store the remaining metadata elsewhere, but as
you say, it has limitations. I could use some solution like your specific
one, but I am only surprised that this problem does not have a well-known
solution, or solutions. Again, how does Google or Yahoo store the files that
they have crawled? MapReduce paper says that they store them all first, that
is a few billion pages. How do they do it?

Raghu,

if I write all files only one, is the cost the same in one directory or do I
need to find the optimal directory size and when full start another
bucket?

Thank you,
Mark

On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer
f...@infochimps.orgwrote:

 I ran in this problem, hard, and I can vouch that this is not a
 windows-only
 problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
 than a few hundred thousand files in the same directory. (The operation to
 correct this mistake took a week to run.)  That is one of several hard
 lessons I learned about don't write your scraper to replicate the path
 structure of each document as a file on disk.

 Cascading the directory structure works, but sucks in various other ways,
 and itself stops scaling after a while.  What I eventually realized is that
 I was using the filesystem as a particularly wrongheaded document database,
 and that the metadata delivery of a filesystem just doesn't work for this.

 Since in our application the files are text and are immutable, our adhoc
 solution is to encode and serialize each file with all its metadata, one
 per
 line, into a flat file.

 A distributed database is probably the correct answer, but this is working
 quite well for now and even has some advantages. (No-cost replication from
 work to home or offline by rsync or thumb drive, for example.)

 flip

 On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com
 wrote:

  Mark Kerzner wrote:
 
  But it would seem then that making a balanced directory tree would not
  help
  either - because there would be another binary search, correct? I
 assume,
  either way it would be as fast as can be :)
 
 
  But the cost of memory copies would be much less with a tree (when you
 add
  and delete files).
 
  Raghu.
 
 
 
 
  On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi rang...@yahoo-inc.com
  wrote:
 
   If you are adding and deleting files in the directory, you might notice
  CPU
  penalty (for many loads, higher CPU on NN is not an issue). This is
  mainly
  because HDFS does a binary search on files in a directory each time it
  inserts a new file.
 
  If the directory is relatively idle, then there is no penalty.
 
  Raghu.
 
 
  Mark Kerzner wrote:
 
   Hi,
 
  there is a performance penalty in Windows (pardon the expression) if
 you
  put
  too many files in the same directory. The OS becomes very slow, stops
  seeing
  them, and lies about their status to my Java requests. I do not know
 if
  this
  is also a problem in Linux, but in HDFS - do I need to balance a
  directory
  tree if I want to store millions of files, or can I put them all in
 the
  same
  directory?
 
  Thank you,
  Mark
 
 
 
 
 


 --
 http://www.infochimps.org
 Connected Open Free Data



What happens in HDFS DataNode recovery?

2009-01-24 Thread C G
Hi All:

I elected to take a node out of one of our grids for service.  Naturally HDFS 
recognized the loss of the DataNode and did the right stuff, fixing replication 
issues and ultimately delivering a clean file system.

So now the node I removed is ready to go back in service.  When I return it to 
service a bunch of files will suddenly have a replication of 4 instead of 3.  
My questions:

1.  Will HDFS delete a copy of the data to bring replication back to 3?
2.  If (1) above is  yes, will it remove the copy by deleting from other nodes, 
or will it remove files from the returned node, or both?

The motivation for asking the questions are that I have a file system which is 
extremely unbalanced - we recently doubled the size of the grid when a few 
dozen terabytes already stored on the existing nodes.  I am wondering if an 
easy way to restore some sense of balance is to cycle through the old nodes, 
removing each one from service for several hours and then return it to service.

Thoughts?

Thanks in Advance, 
C G




  



Job failed when writing a huge file

2009-01-24 Thread tienduc_dinh

Hi everyone,

I'm using now Hadoop 0.18.0 with 1 NameNode and 4 data nodes. By writing the
file bigger than the maximal free space of each data node the job is often
failed. 

I've seen that the file is mostly written only on one node (e.g. N1) and if
this node doesn't have enough space, Hadoop deletes the old chunks on node
N1, tries on another node (e.g. N2) and so on. The job will be failed if the
maximal retries are reached. 

(I don't use the script start-balancer.sh or something like that for
balancing my cluster in this test.)

Sometimes it works after Hadoop really spread the file across the data
nodes.

I think it's not so good that Hadoop writes (and deletes) the whole huge
file again and again instead of spreading it. 

So my question is how does the write algorithm work or how can I find such
information ?

Any help is appreciated, thanks a lot.

Tien Duc Dinh


-- 
View this message in context: 
http://www.nabble.com/Job-failed-when-writing-a-huge-file-tp21647888p21647888.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: HDFS - millions of files in one directory?

2009-01-24 Thread Philip (flip) Kromer
I think that Google developed
BigTablehttp://en.wikipedia.org/wiki/BigTable to
solve this; hadoop's HBase, or any of the myriad other distributed/document
databases should work depending on need:
http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores/
  http://www.mail-archive.com/core-user@hadoop.apache.org/msg07011.html

Heretrix http://en.wikipedia.org/wiki/Heritrix,
Nutchhttp://en.wikipedia.org/wiki/Nutch,
others use the ARC file format
  http://www.archive.org/web/researcher/ArcFileFormat.php
  http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml
These of course are industrial strength tools (and many of their authors are
here in the room with us :) The only question with those tools is whether
their might exceeds your needs.

There's some oddball project out there that does peer-to-peer something
something scraping but I can't find it anywhere in my bookmarks. I don't
recall whether they're file-backed or DB-backed.

If you, like us, want something more modest and targeted there is the
recently-released python-toolkit
  http://lucasmanual.com/mywiki/DataHub
I haven't looked at it to see if they've used it at scale.

We infochimps are working right now to clean up and organize for initial
release our own Infinite Monkeywrench, a homely but effective toolkit for
gathering and munging datasets.  (Those stupid little one-off scripts you
write and throw away? A Ruby toolkit to make them slightly less annoying.)
We frequently use it for directed scraping of APIs and websites.  If you're
willing to deal with pre-release code that's never strayed far from the
machines of the guys what wrote it I can point you to what we have.

I think I was probably too tough on bundling into files. If things are
immutable, and only treated in bulk, and are easily and reversibly
serialized, bundling many documents into a file is probably good. As I said,
our toolkit uses flat text files, with the advantages of simplicity and the
downside of ad hoc-ness. Storing into the ARC format lets you use the tools
in the Big Scraper ecosystem, but obvs. you'd need to convert out to use
with other things, possibly returning you to this same question.

If you need to grab arbitrary subsets of the data, and the one set of
locality tradeoffs is better than the other set of locality tradeoffs, or
you need better metadata management than bundled-into-file gives you then I
think that's why those distributed/document-type databases got invented.

flip

On Sat, Jan 24, 2009 at 7:21 PM, Mark Kerzner markkerz...@gmail.com wrote:

 Philip,

 it seems like you went through the same problems as I did, and confirmed my
 feeling that this is not a trivial problem. My first idea was to balance
 the
 directory tree somehow and to store the remaining metadata elsewhere, but
 as
 you say, it has limitations. I could use some solution like your specific
 one, but I am only surprised that this problem does not have a well-known
 solution, or solutions. Again, how does Google or Yahoo store the files
 that
 they have crawled? MapReduce paper says that they store them all first,
 that
 is a few billion pages. How do they do it?

 Raghu,

 if I write all files only one, is the cost the same in one directory or do
 I
 need to find the optimal directory size and when full start another
 bucket?

 Thank you,
 Mark

 On Fri, Jan 23, 2009 at 11:01 PM, Philip (flip) Kromer
 f...@infochimps.orgwrote:

  I ran in this problem, hard, and I can vouch that this is not a
  windows-only
  problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more
  than a few hundred thousand files in the same directory. (The operation
 to
  correct this mistake took a week to run.)  That is one of several hard
  lessons I learned about don't write your scraper to replicate the path
  structure of each document as a file on disk.
 
  Cascading the directory structure works, but sucks in various other ways,
  and itself stops scaling after a while.  What I eventually realized is
 that
  I was using the filesystem as a particularly wrongheaded document
 database,
  and that the metadata delivery of a filesystem just doesn't work for
 this.
 
  Since in our application the files are text and are immutable, our adhoc
  solution is to encode and serialize each file with all its metadata, one
  per
  line, into a flat file.
 
  A distributed database is probably the correct answer, but this is
 working
  quite well for now and even has some advantages. (No-cost replication
 from
  work to home or offline by rsync or thumb drive, for example.)
 
  flip
 
  On Fri, Jan 23, 2009 at 5:49 PM, Raghu Angadi rang...@yahoo-inc.com
  wrote:
 
   Mark Kerzner wrote:
  
   But it would seem then that making a balanced directory tree would not
   help
   either - because there would be another binary search, correct? I
  assume,
   either way it would be as fast as can be :)
  
  
   But the cost of memory copies would be much less with a tree (when you
  

Re: What happens in HDFS DataNode recovery?

2009-01-24 Thread jason hadoop
The blocks will be invalidated on the returned to service datanode.
If you want to save your namenode and network a lot of work, wipe the hdfs
block storage directory before returning the Datanode to service.
dfs.data.dir will be the directory, most likley the value is
${hadoop.tmp.dir}/dfs/data

Jason - Ex Attributor

On Sat, Jan 24, 2009 at 6:19 PM, C G parallel...@yahoo.com wrote:

 Hi All:

 I elected to take a node out of one of our grids for service.  Naturally
 HDFS recognized the loss of the DataNode and did the right stuff, fixing
 replication issues and ultimately delivering a clean file system.

 So now the node I removed is ready to go back in service.  When I return it
 to service a bunch of files will suddenly have a replication of 4 instead of
 3.  My questions:

 1.  Will HDFS delete a copy of the data to bring replication back to 3?
 2.  If (1) above is  yes, will it remove the copy by deleting from other
 nodes, or will it remove files from the returned node, or both?

 The motivation for asking the questions are that I have a file system which
 is extremely unbalanced - we recently doubled the size of the grid when a
 few dozen terabytes already stored on the existing nodes.  I am wondering if
 an easy way to restore some sense of balance is to cycle through the old
 nodes, removing each one from service for several hours and then return it
 to service.

 Thoughts?

 Thanks in Advance,
 C G