Re: HDFS - millions of files in one directory?

2009-01-23 Thread Philip (flip) Kromer
I ran in this problem, hard, and I can vouch that this is not a windows-only problem. ReiserFS, ext3 and OSX's HFS+ become cripplingly slow with more than a few hundred thousand files in the same directory. (The operation to correct this mistake took a week to run.) That is one of several hard les

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi
Mark Kerzner wrote: But it would seem then that making a balanced directory tree would not help either - because there would be another binary search, correct? I assume, either way it would be as fast as can be :) But the cost of memory copies would be much less with a tree (when you add and d

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
But it would seem then that making a balanced directory tree would not help either - because there would be another binary search, correct? I assume, either way it would be as fast as can be :) On Fri, Jan 23, 2009 at 5:08 PM, Raghu Angadi wrote: > > If you are adding and deleting files in the

Where are the meta data on HDFS ?

2009-01-23 Thread tienduc_dinh
hi everyone, I got a question, maybe you can help me. - how can we get the meta data of a file on HDFS ? For example: If I have a file with e.g. 2 GB on HDFS, this file is split into many chunks and these chunks are distributed on many nodes. Is there any trick to know, which chunks belong to

Re: hadoop balanceing data

2009-01-23 Thread Hairong Kuang
%Remaining is much more fluctuate than %dfs used. This is because dfs shares the disks with mapred and mapred tasks may use a lot of disks temporally. So trying to keep the same %free is impossible most of the time. Hairong On 1/19/09 10:28 PM, "Billy Pearson" wrote: > Why do we not use the Re

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi
Raghu Angadi wrote: If you are adding and deleting files in the directory, you might notice CPU penalty (for many loads, higher CPU on NN is not an issue). This is mainly because HDFS does a binary search on files in a directory each time it inserts a new file. I should add that equal or ev

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Mark V
On Sat, Jan 24, 2009 at 10:03 AM, Mark Kerzner wrote: > Hi, > > there is a performance penalty in Windows (pardon the expression) if you put > too many files in the same directory. The OS becomes very slow, stops seeing > them, and lies about their status to my Java requests. I do not know if this

Re: HDFS - millions of files in one directory?

2009-01-23 Thread Raghu Angadi
If you are adding and deleting files in the directory, you might notice CPU penalty (for many loads, higher CPU on NN is not an issue). This is mainly because HDFS does a binary search on files in a directory each time it inserts a new file. If the directory is relatively idle, then there is

HDFS - millions of files in one directory?

2009-01-23 Thread Mark Kerzner
Hi, there is a performance penalty in Windows (pardon the expression) if you put too many files in the same directory. The OS becomes very slow, stops seeing them, and lies about their status to my Java requests. I do not know if this is also a problem in Linux, but in HDFS - do I need to balance

Re: hadoop consulting?

2009-01-23 Thread Christophe Bisciglia
Thanks Mark. I'll be getting in touch early next week. Others, I see replies default strait to the list. Please feel free to email just me (christo...@cloudera.com), unless, well, you're in the mood to share you bio with everyone :-) Cheers, Christophe On Fri, Jan 23, 2009 at 2:31 PM, Mark Kerzn

Re: How-to in MapReduce

2009-01-23 Thread Mark Kerzner
Tim, I looked there, but it is a set up manual. I read the MapReduce, Sazall, and the MS paper on these, but I need "best practices." Thank you, Mark On Fri, Jan 23, 2009 at 3:22 PM, tim robertson wrote: > Hi, > > Sounds like you might want to look at the Nutch project architecture > and then s

Re: hadoop consulting?

2009-01-23 Thread Mark Kerzner - SHMSoft
Christophe, I am writing my first Hadoop project now, and I have 20 years of consulting, and I am in Houston. Here is my resume, http://markkerzner.googlepages.com. I have used EC2. Sincerely, Mark On Fri, Jan 23, 2009 at 4:04 PM, Christophe Bisciglia < christo...@cloudera.com> wrote: > Hey al

hadoop consulting?

2009-01-23 Thread Christophe Bisciglia
Hey all, I wanted to reach out to the user / development community to start identifying those of you who are interested in consulting / contract work for new Hadoop deployments. A number of our larger customers are asking for more extensive on-site help than would normally happen under a support c

RE: Problem running hdfs_test

2009-01-23 Thread Arifa Nisar
Thanks a lot for your help. I solved that problem by removing LDFLAGS (containing libjvm.so) from hdfs_test compilation. I added that flag to compile correctly using Makefile but that was the real problem. Only after removing it I was able to run with ant. Thanks, Arifa -Original Message-

Re: How-to in MapReduce

2009-01-23 Thread tim robertson
Hi, Sounds like you might want to look at the Nutch project architecture and then see the Nutch on Hadoop tutorial - http://wiki.apache.org/nutch/NutchHadoopTutorial It does web crawling, and indexing using Lucene. It would be a good place to start anyway for ideas, even if it doesn't end up mee

How-to in MapReduce

2009-01-23 Thread Mark Kerzner
Hi, esteemed group, how would I form Maps in MapReduce to recursevely look at every file in a directory, and do something to this file, such as produce a PDF or compute its hash? For that matter, Google builds its index using MapReduce, or so the papers say. First the crawlers store all the files.

Re: Why does Hadoop need ssh access to master and slaves?

2009-01-23 Thread Edward Capriolo
I am looking to create some RA scripts and experiment with starting hadoop via linux-ha cluster manager. Linux HA would handle restarting downed nodes and eliminate the ssh key dependency.

Re: using distcp for http source files

2009-01-23 Thread Doug Cutting
Can you please attach your latest version of this to https://issues.apache.org/jira/browse/HADOOP-496? Thanks, Doug Boris Musykantski wrote: we have fixed some patches in JIRA for support of webdav server on top of HDFS, updated to work with newer version (0.18.0 IIRC) and added support for

Re: HDFS loosing blocks or connection error

2009-01-23 Thread Raghu Angadi
> It seems hdfs isn't so robust or reliable as the website says and/or I > have a configuration issue. quite possible. How robust does the website say it is? I agree debuggings failures like the following is pretty hard for casual users. You need look at the logs for block, or run 'bin/hadoop

AlredyBeingCreatedExceptions after upgrade to 0.19.0

2009-01-23 Thread Stefan Will
Hi, Since I¹ve upgraded to 0.19.0, I¹ve been getting the following exceptions when restarting jobs, or even when a failed reducer is being restarted by the job tracker. It appears that stale file locks in the namenode don¹t get properly released sometimes: org.apache.hadoop.ipc.RemoteException: o

Re: HDFS loosing blocks or connection error

2009-01-23 Thread Konstantin Shvachko
Yes guys. We observed such problems. They will be common for 0.18.2 and 0.19.0 exactly as you described it when data-nodes become unstable. There were several issues, please take a look HADOOP-4997 workaround for tmp file handling on DataNodes HADOOP-4663 - links to other related HADOOP-4810 Data

Re: HDFS loosing blocks or connection error

2009-01-23 Thread Jean-Daniel Cryans
Yes you may overload your machines that way because of the small number. One thing to do would be to look in the logs for any signs of IOExceptions and report them back here. Another thing you can do is to change some configs. Increase *dfs.datanode.max.xcievers* to 512 and set the *dfs.datanode.so

RE: HDFS loosing blocks or connection error

2009-01-23 Thread Zak, Richard [USA]
It happens right after the MR job (though once or twice its happened during). I am not using EBS, just HDFS between the machines. As for tasks, there are 4 mappers and 0 reducers. Richard J. Zak -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-D

Re: HDFS loosing blocks or connection error

2009-01-23 Thread Jean-Daniel Cryans
xlarge is good. Is it normally happening during a MR job? If so, how many tasks do you have running at the same moment overall? Also, is your data stored on EBS? Thx, J-D On Fri, Jan 23, 2009 at 12:55 PM, Zak, Richard [USA] wrote: > 4 slaves, 1 master, all are the m1.xlarge instance type. > > >

RE: HDFS loosing blocks or connection error

2009-01-23 Thread Zak, Richard [USA]
4 slaves, 1 master, all are the m1.xlarge instance type. Richard J. Zak -Original Message- From: jdcry...@gmail.com [mailto:jdcry...@gmail.com] On Behalf Of Jean-Daniel Cryans Sent: Friday, January 23, 2009 12:34 To: core-user@hadoop.apache.org Subject: Re: HDFS loosing blocks or connec

Re: HDFS loosing blocks or connection error

2009-01-23 Thread Jean-Daniel Cryans
Richard, This happens when the datanodes are too slow and eventually all replicas for a single block are tagged as "bad". What kind of instances are you using? How many of them? J-D On Fri, Jan 23, 2009 at 12:13 PM, Zak, Richard [USA] wrote: > Might there be a reason for why this seems to rou

HDFS loosing blocks or connection error

2009-01-23 Thread Zak, Richard [USA]
Might there be a reason for why this seems to routinely happen to me when using Hadoop 0.19.0 on Amazon EC2? 09/01/23 11:45:52 INFO hdfs.DFSClient: Could not obtain block blk_-1757733438820764312_6736 from any node: java.io.IOException: No live nodes contain current block 09/01/23 11:45:55 INFO

AW: Why does Hadoop need ssh access to master and slaves?

2009-01-23 Thread Matthias Scherer
Hi Tom, Thanks for your reply. That's what I wanted to know. And it's good to know that it would not be a show stopper if our ops department would like to use their own system to control daemons. Regards Matthias > -Ursprüngliche Nachricht- > Von: Tom White [mailto:t...@cloudera.com]

Re: Problem running hdfs_test

2009-01-23 Thread Rasit OZDAS
Hi, Arifa I had to add "LD_LIBRARY_PATH" env. var. to correctly run my example. But I have no idea if it helps, because my error wasn't a segmentation fault. I would try it anyway. LD_LIBRARY_PATH:/usr/JRE/jre1.6.0_11/jre1.6.0_11/lib:/usr/JRE/jre1.6.0_11/jre1.6.0_11/lib/amd64/server (server dire

Re: Distributed cache testing in local mode

2009-01-23 Thread Tom White
It would be nice to make this more uniform. There's an outstanding Jira on this if anyone is interested in looking at it: https://issues.apache.org/jira/browse/HADOOP-2914 Tom On Fri, Jan 23, 2009 at 12:14 AM, Aaron Kimball wrote: > Hi Bhupesh, > > I've noticed the same problem -- LocalJobRunner

_temporary directory getting deleted mid-job?

2009-01-23 Thread Aaron Kimball
I saw some puzzling behavior tonight when running a MapReduce program I wrote. It would perform the mapping just fine, and would begin to shuffle. It got to 33% complete reduce (end of shuffle) and then the task fails, claiming that /_temporary was deleted. I didn't touch HDFS while this was goin