Re: Creating Lucene index in Hadoop

2009-03-16 Thread Doug Cutting
Ning Li wrote: With http://issues.apache.org/jira/browse/HADOOP-4801, however, it may become feasible to search on HDFS directly. I don't think HADOOP-4801 is required. It would help, certainly, but it's so fraught with security and other issues that I doubt it will be committed anytime

Re: Not a host:port pair when running balancer

2009-03-11 Thread Doug Cutting
Konstantin Shvachko wrote: The port was not specified at all in the original configuration. Since 0.18, the port is optional. If no port is specified, then 8020 is used. 8020 is the default port for namenodes. https://issues.apache.org/jira/browse/HADOOP-3317 Doug

Re: Recommend JSON Library? net.sf.json has memory leak

2009-03-05 Thread Doug Cutting
Ian Swett wrote: We've used Jackson(http://jackson.codehaus.org/), which we've found to be easy to use and faster than any other option. I also use Jackson and recommend it. Doug

Re: How does NVidia GPU compare to Hadoop/MapReduce

2009-02-27 Thread Doug Cutting
I think they're complementary. Hadoop's MapReduce lets you run computations on up to thousands of computers potentially processing petabytes of data. It gets data from the grid to your computation, reliably stores output back to the grid, and supports grid-global computations (e.g.,

Re: FileInputFormat directory traversal

2009-02-03 Thread Doug Cutting
Hi, Ian. One reason is that a MapFile is represented by a directory containing two files named index and data. SequenceFileInputFormat handles MapFiles too by, if an input file is a directory containing a data file, using that file. Another reason is that's what reduces generate. Neither

Re: Question about HDFS capacity and remaining

2009-01-30 Thread Doug Cutting
Bryan Duxbury wrote: Hm, very interesting. Didn't know about that. What's the purpose of the reservation? Just to give root preference or leave wiggle room? I think it's so that, when the disk is full, root processes don't fail, only user processes. So you don't lose, e.g., syslog. With

Re: Question about HDFS capacity and remaining

2009-01-29 Thread Doug Cutting
Ext2 by default reserves 5% of the drive for use by root only. That'd be 45MB of your 907GB capacity which would account for most of the discrepancy. You can adjust this with tune2fs. Doug Bryan Duxbury wrote: There are no non-dfs files on the partitions in question. df -h indicates that

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting
Philip (flip) Kromer wrote: Heretrix http://en.wikipedia.org/wiki/Heritrix, Nutchhttp://en.wikipedia.org/wiki/Nutch, others use the ARC file format http://www.archive.org/web/researcher/ArcFileFormat.php http://www.digitalpreservation.gov/formats/fdd/fdd000236.shtml Nutch does not use ARC

Re: HDFS - millions of files in one directory?

2009-01-26 Thread Doug Cutting
Mark Kerzner wrote: Okay, I am convinced. I only noticed that Doug, the originator, was not happy about it - but in open source one has to give up control sometimes. I think perhaps you misunderstood my remarks. My point was that, if you looked to Nutch's Content class for an example, it is,

Re: using distcp for http source files

2009-01-23 Thread Doug Cutting
for permissions. See code and description here: http://www.hadoop.iponweb.net/Home/hdfs-over-webdav Hope it is useful, Regards, Boris, IPonWeb On Thu, Jan 22, 2009 at 2:30 PM, Doug Cutting cutt...@apache.org wrote: Aaron Kimball wrote: Is anyone aware of an OSS web dav library that could

Re: using distcp for http source files

2009-01-22 Thread Doug Cutting
Aaron Kimball wrote: Doesn't the WebDAV protocol use http for file transfer, and support reads / writes / listings / etc? Yes. Getting a WebDAV-based FileSystem in Hadoop has long been a goal. It could replace libhdfs, since there are already a WebDav-based FUSE filesystem for Linux (wdfs,

Re: using distcp for http source files

2009-01-21 Thread Doug Cutting
Derek Young wrote: Reading http://issues.apache.org/jira/browse/HADOOP-341 it sounds like this should be supported, but the http URLs are not working for me. Are http source URLs still supported? No. They used to be supported, but when distcp was converted to accept any Path this stopped

Re: Auditing and accounting with Hadoop

2009-01-07 Thread Doug Cutting
The notion of a client/task ID, independent of IP or username seems useful for log analysis. DFS's client ID is probably currently your best bet, but we might improve its implementation, and make the notion more generic. It is currently implemented as: String taskId =

Re: ssh problem

2009-01-05 Thread Doug Cutting
Ubuntu does not include the ssh server in client installations, so you need to install it yourself. sudo apt-get install openssh-server Doug vinayak katkar wrote: Hey When I tried to install hadoop in ubuntu 8.04 I got an error ssh connection refused to localhost at port 22. Please any one

Re: Map input records(on JobTracker website) increasing and decreasing

2009-01-05 Thread Doug Cutting
Values can drop if tasks die and must be re-run. Doug Aaron Kimball wrote: The actual number of input records is most likely steadily increasing. The counters on the web site are inaccurate until the job is complete; their values will fluctuate wildly. I'm not sure why this is. - Aaron On

Re: Hadoop corrupting files if file block size is 4GB and file size is 2GB

2008-12-22 Thread Doug Cutting
Why are you using such a big block size? I suspect this problem will go away if you decrease your blocksize to less than 2GB. This sounds like a bug, probably related to integer overflow: some part of Hadoop is using an 'int' where it should be using a 'long'. Please file an issue in Jira,

Re: [video] visualization of the hadoop code history

2008-12-16 Thread Doug Cutting
Owen O'Malley wrote: It is interesting, but it would be more interesting to track the authors of the patch rather than the committer. The two are rarely the same. Indeed. There was a period of over a year where I wrote hardly anything but committed almost everything. So I am vastly

Re: File loss at Nebraska

2008-12-09 Thread Doug Cutting
Steve Loughran wrote: Alternatively, why we should be exploring the configuration space more widely Are you volunteering? Doug

Re: File loss at Nebraska

2008-12-08 Thread Doug Cutting
Brian Bockelman wrote: To some extent, this whole issue is caused because we only have enough space for 2 replicas; I'd imagine that at 3 replicas, the issue would be much harder to trigger. The unfortunate reality is that if you run a configuration that's different than most you'll likely

Re: ${user.name}, ${user.host}?

2008-12-03 Thread Doug Cutting
Variables in configuration files may be Java system properties or other configuration parameters. The list of pre-defined Java system properties is at: http://java.sun.com/javase/6/docs/api/java/lang/System.html#getProperties() Unfortunately the host name is not in that list. You could

Re: Namenode BlocksMap on Disk

2008-12-01 Thread Doug Cutting
Billy Pearson wrote: We are looking for a way to support smaller clusters also that might over run there heap size causing the cluster to crash. Support for namespaces larger than RAM would indeed be a good feature to have. Implementing this without impacting large cluster in-memory

Amazon Web Services (AWS) Hosted Public Data Sets

2008-12-01 Thread Doug Cutting
This looks like it could be a great feature for EC2-based Hadoop users: http://aws.amazon.com/publicdatasets/ Has anyone tried it yet? Any datasets to share? Doug

Re: Which replica?

2008-12-01 Thread Doug Cutting
A task may read from more than one block. For example, in line-oriented input, lines frequently cross block boundaries. And a block may be read from more than one host. For example, if a datanode dies midway through providing a block, the client will switch to using a different datanode.

Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting
Dennis Kubes wrote: 2) Besides possible slight degradation in performance, is there a reason why the BlocksMap shouldn't or couldn't be stored on disk? I think the assumption is that it would be considerably more than slight degradation. I've seen the namenode benchmarked at over 50,000

Re: Namenode BlocksMap on Disk

2008-11-26 Thread Doug Cutting
Brian Bockelman wrote: Do you have any graphs you can share showing 50k opens / second (could be publicly or privately)? The more external benchmarking data I have, the more I can encourage adoption amongst my university... The 50k opens/second is from some internal benchmarks run at Y!

Re: Lookup HashMap available within the Map

2008-11-25 Thread Doug Cutting
tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with

Re: SecondaryNameNode on separate machine

2008-10-31 Thread Doug Cutting
Otis Gospodnetic wrote: Konstantin Co, please correct me if I'm wrong, but looking at hadoop-default.xml makes me think that dfs.http.address is only the URL for the NN *Web UI*. In other words, this is where we people go look at the NN. The secondary NN must then be using only the Primary

Re: Understanding file splits

2008-10-28 Thread Doug Cutting
This is hard to diagnose without knowing your InputFormat. Each split returned by your #getSplits() implementation is passed to your #getRecordReader() implementation. If your RecordReader is not stopping when you expect it to, then that's a problem in your RecordReader, no? Have you written

Re: Using hadoop as storage cluster?

2008-10-27 Thread Doug Cutting
David C. Kerber wrote: There would be quite a few files in the 100kB to 2MB range, which are received and processed daily, with smaller numbers ranging up to ~600MB or so which are summarizations of many of the daily data files, and maybe a handful in the 1GB - 6GB range (disk images and

Re: Distributed cache Design

2008-10-16 Thread Doug Cutting
Bhupesh Bansal wrote: Minor correction the graph size is about 6G and not 8G. Ah, that's better. With the jvm reuse feature in 0.19 you should be able to load it once per job into a static, since all tasks of that job can share a JVM. Things will get tight if you try to run two such jobs at

Re: Hadoop chokes on file names with : in them

2008-10-10 Thread Doug Cutting
The safest thing is to restrict your Hadoop file names to a common-denominator set of characters that are well supported by Unix, Windows, and URIs. Colon is a special character on both Windows and in URIs. Quoting is in theory possible, but it's hard to get it right everywhere in practice.

Re: LZO and native hadoop libraries

2008-09-30 Thread Doug Cutting
Arun C Murthy wrote: You need to add libhadoop.so to your java.library.patch. libhadoop.so is available in the corresponding release in the lib/native directory. I think he needs to first build libhadoop.so, since he appears to be running on OS X and we only provide Linux builds of this in

Re: Monthly Hadoop User Group Meeting (Bay Area)

2008-09-09 Thread Doug Cutting
Chris K Wensel wrote: doh, conveniently collides with the GridGain and GridDynamics presentations: http://web.meetup.com/66/calendar/8561664/ Bay Area Hadoop User Group meetings are held on the third Wednesday every month. This has been on the calendar for quite a while. Doug

Re: Hadoop + Elastic Block Stores

2008-09-08 Thread Doug Cutting
Ryan LeCompte wrote: I'd really love to one day see some scripts under src/contrib/ec2/bin that can setup/mount the EBS volumes automatically. :-) The fastest way might be to write contribute such scripts! Doug

Re: About the nama of second-namenode!!

2008-09-08 Thread Doug Cutting
Changing it will unfortunately cause confusion too. Sigh. This is why we should take time to name things well the first time. Doug 叶双明 wrote: Because the name of second-namenode making so much confusing, does the hadoop team consider to change it?

Re: JVM Spawning

2008-09-05 Thread Doug Cutting
LocalJobRunner allows you to test your code with everything running in a single JVM. Just set mapred.job.tracker=local. Doug Ryan LeCompte wrote: I see... so there really isn't a way for me to test a map/reduce program using a single node without incurring the overhead of upping/downing

Re: Timeouts at reduce stage

2008-09-04 Thread Doug Cutting
Jason Venner wrote: We have modified the /main/ that launches the children of the task tracker to explicity exit, in it's finally block. That helps substantially. Have you submitted this as a patch? Doug

Re: Are lines broken in dfs and/or in InputSplit

2008-08-07 Thread Doug Cutting
Kevin wrote: Yes, I have looked at the block files and it matches what you said. I am just wondering if there is some property or flag that would turn this feature on, if it exists. No. If you required this then you'd need to pad your data, but I'm not sure why you'd ever require it.

Re: Confusing NameNodeFailover page in Hadoop Wiki

2008-08-06 Thread Doug Cutting
Konstantin Shvachko wrote: Imho we either need to correct it or remove. +1 Doug

Re: map reduce map reduce?

2008-07-30 Thread Doug Cutting
Elia Mazzawi wrote: is it possible to run a map then reduce then a map then a reduce. its really 2 jobs, but i don't want to store the intermediate results. so can a hadoop job do more than one map/reduce? This has been discussed several times before. The problem is that temporary data is

Re: Namenode Exceptions with S3

2008-07-17 Thread Doug Cutting
Tom White wrote: You can allow S3 as the default FS, it's just that then you can't run HDFS at all in this case. You would only do this if you don't want to use HDFS at all, for example, if you were running a MapReduce job which read from S3 and wrote to S3. Can't one work around this by using

Re: Getting stats of running job from within job

2008-07-03 Thread Doug Cutting
Nathan Marz wrote: Is there a way to get stats of the currently running job programatically? This should probably be an FAQ. In your Mapper or Reducer's configure implementation, you can get a handle on the running job with: RunningJob running = new

Re: When is Hadoop 0.18 release scheduled ?

2008-06-27 Thread Doug Cutting
Tarandeep Singh wrote: When is Hadoop 0.18 release scheduled ? This link has a date of 6 June :-/ http://issues.apache.org/jira/browse/HADOOP/fixforversion/12312972 The release date is initially set to the feature freeze date. It's updated when all of the blockers are fixed and an actual

Re: How Mappers function and solultion for my input file problem?

2008-06-26 Thread Doug Cutting
Ted Dunning wrote: The map task is not multi-threaded [ ... ] Unless you specify a multi-threaded MapRunnable... http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/lib/MultithreadedMapRunner.html Doug

Re: client connect as different username?

2008-06-12 Thread Doug Cutting
Chris Collins wrote: For instance, that all it requires is for me to create the ability for say a mac user with login of bob to access things under /bob is for me to go in as the super user and do something like: hadoop dfs -mkdir /bob hadoop dfs -chown bob /bob where bob literally doesnt

Re: client connect as different username?

2008-06-11 Thread Doug Cutting
Chris Collins wrote: You are referring to creating a directory in hdfs? Because if I am user chris and the hdfs only has user foo, then I cant create a directory because I dont have perms, infact I cant even connect. Today, users and groups are declared by the client. The namenode only

Re: Stackoverflow

2008-06-04 Thread Doug Cutting
Andreas Kostyrka wrote: java.lang.StackOverflowError at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTask.java:494) at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29) at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58) at

Re: Remote Job Submission

2008-05-23 Thread Doug Cutting
Ted Dunning wrote: - in order to submit the job, I think you only need to see the job-tracker. Somebody should correct me if I am wrong. No, you also need to be able to write the job.xml, job.jar, and job.split into HDFS. Someday perhaps we'll pass these via RPC to the jobtracker and have

Re: Block re-balancing speed/slowness

2008-05-12 Thread Doug Cutting
Otis Gospodnetic wrote: 10 GB in 3 h doesn't that seem slow? Have you played with dfs.balance.bandwidthPerSec? It defaults to 1MB/sec per datanode. That would be about 10GB in 3 hours. Doug

Re: newbie how to get url paths of files in HDFS

2008-05-08 Thread Doug Cutting
Ted Dunning wrote: Take the fully qualified HDFS path that looks like this: hdfs://namenode-host-name:port/file-path And transform it into this: hdfs://namenode-host-name:web-interface-port/data/file-path The web-interface-port is 50070 by default. This will allow you to read HDFS

Re: where is the documentation for MiniDFSCluster

2008-05-05 Thread Doug Cutting
Maneesha Jain wrote: I'm looking for any documentation or javadoc for MiniDFSCluster and have not been able to find it anywhere. Can someone please point me to it. http://svn.apache.org/repos/asf/hadoop/core/trunk/src/test/org/apache/hadoop/dfs/MiniDFSCluster.java This is part of the test

Re: Master Heap Size and Master Startup Time vs. Number of Blocks

2008-05-02 Thread Doug Cutting
Cagdas Gerede wrote: In the system I am working, we have 6 million blocks total and the namenode heap size is about 600 MB and it takes about 5 minutes for namenode to leave the safemode. How big is are your files? Are they several blocks on average? Hadoop is not designed for small files,

Re: HDFS: Good practices for Number of Blocks per Datanode

2008-05-02 Thread Doug Cutting
Cagdas Gerede wrote: For a system with 60 million blocks, we can have 3 datanodes with 20 million blocks each, or we can have 60 datanodes with 1 million blocks each. In either case, would there be performance implications or would they behave the same way? If you're using mapreduce, then you

Re: Master Heap Size and Master Startup Time vs. Number of Blocks

2008-05-02 Thread Doug Cutting
Cagdas Gerede wrote: We will have 5 million files each having 20 blocks of 2MB. With the minimum replication of 3, we would have 300 million blocks. 300 million blocks would store 600TB. At ~10TB/node, this means a 60 node system. Do you think these numbers are suitable for Hadoop DFS. Why

Re: Best practices for handling many small files

2008-04-28 Thread Doug Cutting
Joydeep Sen Sarma wrote: There seems to be two problems with small files: 1. namenode overhead. (3307 seems like _a_ solution) 2. map-reduce processing overhead and locality It's not clear from 3307 description, how the archives interface with map-reduce. How are the splits done? Will they

Re: hadoop and deprecation

2008-04-24 Thread Doug Cutting
Karl Wettin wrote: When is depricated methods removed from the API? At new every minor? http://wiki.apache.org/hadoop/Roadmap Note the remark: Prior to 1.0, minor releases follow the rules for major releases, except they are still made every few months. So, since we're still pre-1.0, we

Re: Using ArrayWritable of type IntWritable

2008-04-21 Thread Doug Cutting
CloudyEye wrote: What else do I have to override in ArrayWritable to get the IntWritable values written to the output files by the reducers? public String toString(); Doug

Re: jar files on NFS instead of DistributedCache

2008-04-18 Thread Doug Cutting
Mikhail Bautin wrote: Specifically, I just need a way to alter the child JVM's classpath via JobConf, without having the framework copy anything in and out of HDFS, because all my files are already accessible from all nodes. I see how to do that by adding a couple of lines to TaskRunner's run()

Re: Performance / cluster scaling question

2008-03-28 Thread Doug Cutting
Doug Cutting wrote: Seems like we should force things onto the same availablity zone by default, now that this is available. Patch, anyone? It's already there! I just hadn't noticed. https://issues.apache.org/jira/browse/HADOOP-2410 Sorry for missing this, Chris! Doug

Re: setMapOutputValueClass doesn't work

2008-03-24 Thread Doug Cutting
Chang Hu wrote: Code below, also attached. I put this together from the word count example. The problem is with your combiner. When a combiner is specified, it generates the final map output, since combination is a map-side operation. Your combiner takes Text,IntWritable generated by

Re: MapFile and MapFileOutputFormat

2008-03-20 Thread Doug Cutting
Rong-en Fan wrote: I have two questions regarding the mapfile in hadoop/hdfs. First, when using MapFileOutputFormat as reducer's output, is there any way to change the index interval (i.e., able to call setIndexInterval() on the output MapFile)? Not at present. It would probably be good to

Re: Multiple Output Value Classes

2008-03-17 Thread Doug Cutting
Stu Hood wrote: But I'm trying to _output_ multiple different value classes from a Mapper, and not having any luck. You can wrap things in ObjectWritable. When writing, this records the class name with each instance, then, when reading, constructs an appropriate instance and reads it. It

Re: Separate data-nodes from worker-nodes

2008-03-14 Thread Doug Cutting
Andrey Pankov wrote: It's a little bit expensive to have big cluster running for a long period, especially if you use EC2. So, as possible solution, we can start additional nodes and include them into cluster before running job, and then, after finishing, kill unused nodes. As Ted has

Re: What's the best way to get to a single key?

2008-03-03 Thread Doug Cutting
Use MapFileOutputFormat to write your data, then call: http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/MapFileOutputFormat.html#getEntry(org.apache.hadoop.io.MapFile.Reader[],%20org.apache.hadoop.mapred.Partitioner,%20K,%20V) The documentation is pretty sparse, but the

Re: Sorting output data on value

2008-02-22 Thread Doug Cutting
Tarandeep Singh wrote: but isn't the output of reduce step sorted ? No, the input of reduce is sorted by key. The output of reduce is generally produced as the input arrives, so is generally also sorted by key, but reducers can output whatever they like. Doug

Re: define backwards compatibility

2008-02-21 Thread Doug Cutting
Joydeep Sen Sarma wrote: i find the confusion over what backwards compatibility means scary - and i am really hoping that the outcome of this thread is a clear definition from the committers/hadoop-board of what to reasonably expect (or not!) going forward. The goal is clear: code that

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting
Jason Venner wrote: Is disk arm contention (seek) a problem in a 6 disk configuration as most likely all of the disks would be serving /local/ and /dfs/? It should not be. MapReduce i/o is is sequential, in chunks large enough that seeks should not dominate. Doug

Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting
Jason Venner wrote: We have 3 types of machines we can get, 2 disk, 6 disk and 16 disk machines. They all have 4 dual core cpus. The 2 disk machines have about 1 TB, the 6 disks about 3TB and the 16 disk about 8TB. The 16 disk machines have about 25% slower CPU's than the 2/6 disk

Re: Hadoop upgrade wiki page

2008-02-05 Thread Doug Cutting
Marc Harris wrote: The hadoop upgrade wiki page contains a small typo http://wiki.apache.org/hadoop/Hadoop_Upgrade . [ ... ] I don't have access to modify it, but someone else might like to. Anyone can create themselves an account on the wiki and modify any page. Doug

Re: Hadoop future?

2008-02-01 Thread Doug Cutting
Lukas Vlcek wrote: I think you have already heard rumours about Microsoft could buy Yahoo. Does anybody have any idea how this could impact specifically Hadoop future? First, Hadoop is an Apache project. Y! contributes to it, along with others. Apache projects are designed to be able to