Re: Debugging in Hadoop

2009-01-26 Thread Amareshwari Sriramadasu
patektek wrote: Hello list, I am trying to add some functionality to Hadoop-core and I am having serious issues debugging it. I have searched in the list archive and still have not been able to resolve the issues. Simple question: If I want to insert "LOG.INFO()" statements in Hadoop code is not

Re: Netbeans/Eclipse plugin

2009-01-26 Thread Amit k. Saha
On Tue, Jan 27, 2009 at 2:52 AM, Aaron Kimball wrote: > The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/) > currently is inoperable. The DFS viewer works, but the job submission code > is broken. I have started conversation with 3 other community members to work on the N

Re: Zeroconf for hadoop

2009-01-26 Thread Vadim Zaliva
On Mon, Jan 26, 2009 at 11:22, Edward Capriolo wrote: > Zeroconf is more focused on simplicity then security. One of the > original problems that may have been fixes is that any program can > announce any service. IE my laptop can announce that it is the DNS for > google.com etc. I see two distin

DBOutputFormat and auto-generated keys

2009-01-26 Thread Vadim Zaliva
Is it possible to obtain auto-generated IDs when writing data using DBOutputFormat? For example, is it possible to write Mapper which stores records in DB and returns auto-generated IDs of these records? Let me explain what I am trying to achieve: I have data like this which I would like to st

files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-26 Thread Yuanyuan Tian
Hi, I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck, everything seems fine. All the files can be listed in hdfs and the sizes are also correct. But when a mapreduce job tries to read the files as i

Re: Mapred job parallelism

2009-01-26 Thread Aaron Kimball
Indeed, you will need to enable the Fair Scheduler or Capacity Scheduler (which are both in 0.19) to do this. mapred.map.tasks is more a hint than anything else -- if you have more files to map than you set this value to, it will use more tasks than you configured the job to. The newer schedulers w

Re: Mapred job parallelism

2009-01-26 Thread jason hadoop
I believe that the schedule code in 0.19.0 has a framework for this, but I haven't dug into it in detail yet. http://hadoop.apache.org/core/docs/r0.19.0/capacity_scheduler.html >From what I gather you would set up 2 queues, each with guaranteed access to 1/2 of the cluster Then you submit your jo

Mapred job parallelism

2009-01-26 Thread Sagar Naik
Hi Guys, I was trying to setup a cluster so that two jobs can run simultaneously. The conf : number of nodes : 4(say) mapred.tasktracker.map.tasks.maximum=2 and in the joblClient mapred.map.tasks=4 (# of nodes) I also have a condition, that each job should have only one map-task per node

Re: Netbeans/Eclipse plugin

2009-01-26 Thread Aaron Kimball
The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/) currently is inoperable. The DFS viewer works, but the job submission code is broken. - Aaron On Sun, Jan 25, 2009 at 9:07 PM, Amit k. Saha wrote: > On Sun, Jan 25, 2009 at 9:32 PM, Edward Capriolo > wrote: > > On Sun,

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Jason, this is awesome, thank you. By the way, is there a book or manual with "best practices?" On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop wrote: > Sequence files rock, and you can use the > * > bin/hadoop dfs -text FILENAME* command line tool to get a toString level > unpacking of the sequenc

Re: What happens in HDFS DataNode recovery?

2009-01-26 Thread Aaron Kimball
Also, see the balancer tool that comes with Hadoop. This background process should be run periodically (Every week or so?) to make sure that data's evenly distributed. http://hadoop.apache.org/core/docs/r0.19.0/hdfs_user_guide.html#Rebalancer - Aaron On Sat, Jan 24, 2009 at 7:40 PM, jason hadoop

Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
Sequence files rock, and you can use the * bin/hadoop dfs -text FILENAME* command line tool to get a toString level unpacking of the sequence file key,value pairs. If you provide your own key or value classes, you will need to implement a toString method to get some use out of this. Also, your cla

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Thank you, Doug, then all is clear in my head. Mark On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting wrote: > Mark Kerzner wrote: > >> Okay, I am convinced. I only noticed that Doug, the originator, was not >> happy about it - but in open source one has to give up control sometimes. >> > > I think

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting
Mark Kerzner wrote: Okay, I am convinced. I only noticed that Doug, the originator, was not happy about it - but in open source one has to give up control sometimes. I think perhaps you misunderstood my remarks. My point was that, if you looked to Nutch's Content class for an example, it is,

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Okay, I am convinced. I only noticed that Doug, the originator, was not happy about it - but in open source one has to give up control sometimes. Thank you, Mark On Mon, Jan 26, 2009 at 2:36 PM, Andy Liu wrote: > SequenceFile supports transparent block-level compression out of the box, > so > yo

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Andy Liu
SequenceFile supports transparent block-level compression out of the box, so you don't have to compress data in your code. Most the time, compression not only saves disk space but improves performance because there's less data to write. Andy On Mon, Jan 26, 2009 at 12:35 PM, Mark Kerzner wrote:

Re: Hadoop 0.19 over OS X : dfs error

2009-01-26 Thread nitesh bhatia
Well its strange.. although I changed default JAVA environment to Java 6 64bit but still my /Library/Java/Home was pointing to java 5. So in config/hadoop_env.sh I changed the path of JAVA_HOME to actual path i.e /System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home .Its working now. O

Re: Hadoop 0.19 over OS X : dfs error

2009-01-26 Thread Raghu Angadi
nitesh bhatia wrote: Thanks. It worked. :) in hadoop-env.sh its required to write exact path for java framework. I changed it to export JAVA_HOME=/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home and it started. In hadoop 0.18.2 export JAVA_HOME=/Library/Java/Home is working fine.

Re: Zeroconf for hadoop

2009-01-26 Thread Raghu Angadi
nitesh bhatia wrote: Hi Apple provides opensource discovery service called Bonjour (zeroconf). Is it possible to integrate Zeroconf with Hadoop so that discovery of nodes become automatic ? Presently for setting up multi-node cluster we need to add IPs manually. Integrating it with bonjour can ma

Re: Zeroconf for hadoop

2009-01-26 Thread Edward Capriolo
Zeroconf is more focused on simplicity then security. One of the original problems that may have been fixes is that any program can announce any service. IE my laptop can announce that it is the DNS for google.com etc. I want to mention a related topic to the list. People are approaching the auto-

Re: Zeroconf for hadoop

2009-01-26 Thread nitesh bhatia
For a closed uniform system (yahoo, google), this can work best. This can provide plug-n-play type of system. Through this we can change clusters to dynamic grids. But I am not sure of outcome so far, I am reading the documentation. --nitesh On Mon, Jan 26, 2009 at 1:59 PM, Allen Wittenauer wro

Re: Zeroconf for hadoop

2009-01-26 Thread Raghu Angadi
Nitay wrote: Why not use the distributed coordination service ZooKeeper? When nodes come up they write some ephemeral file in a known ZooKeeper directory and anyone who's interested, i.e. NameNode, can put a watch on the directory and get notified when new children come up. NameNode does not do

Re: Zeroconf for hadoop

2009-01-26 Thread Nitay
Why not use the distributed coordination service ZooKeeper? When nodes come up they write some ephemeral file in a known ZooKeeper directory and anyone who's interested, i.e. NameNode, can put a watch on the directory and get notified when new children come up. On Mon, Jan 26, 2009 at 10:59 AM, Al

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Raghu Angadi
Mark Kerzner wrote: Raghu, if I write all files only one, is the cost the same in one directory or do I need to find the optimal directory size and when full start another "bucket?" If you write only once, then writing won't be much of an issue. You can write them in lexical order to help wit

Re: Zeroconf for hadoop

2009-01-26 Thread Allen Wittenauer
On 1/25/09 8:45 AM, "nitesh bhatia" wrote: > Apple provides opensource discovery service called Bonjour (zeroconf). Is it > possible to integrate Zeroconf with Hadoop so that discovery of nodes become > automatic ? Presently for setting up multi-node cluster we need to add IPs > manually. Integ

Re: HDFS - millions of files in one directory?

2009-01-26 Thread jason hadoop
We like compression if the data is readily compressible and large as it saves on IO time. On Mon, Jan 26, 2009 at 9:35 AM, Mark Kerzner wrote: > Doug, > SequenceFile looks like a perfect candidate to use in my project, but are > you saying that I better use uncompressed data if I am not interes

setNumTasksToExecutePerJvm and Configure

2009-01-26 Thread Saptarshi Guha
Hello, Suppose I set setNumTasksToExecutePerJvm to -1. Then, the same jvm may run several tasks consecutively. 1) Will the configure method(if present) be run for every task? Or only for the first task that the jvm runs? 2)Similarly, the close method(if present) will be run for the / las

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Mark Kerzner
Doug, SequenceFile looks like a perfect candidate to use in my project, but are you saying that I better use uncompressed data if I am not interested in saving disk space? Thank you, Mark On Mon, Jan 26, 2009 at 11:30 AM, Doug Cutting wrote: > Philip (flip) Kromer wrote: > >> Heretrix

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting
Philip (flip) Kromer wrote: Heretrix , Nutch, others use the ARC file format http://www.archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml Nutch does not use A

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Steve Loughran
Philip (flip) Kromer wrote: I ran in this problem, hard, and I can vouch that this is not a windows-only problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more than a few hundred thousand files in the same directory. (The operation to correct this mistake took a week to run.) T