Number of directories problem in MapReduce operations

2009-01-28 Thread Guillaume Smet
Hi, For a few weeks now, we experience a rather annoying problem with a Nutch/Hadoop installation. It's a very simple setup: the Hadoop configuration is the default from Nutch. The version of Hadoop is the hadoop-0.17.1 jar provided by Nutch. During the injection operation, we now have the

Re: Number of records in a MapFile

2009-01-28 Thread Rasit OZDAS
Do you mean, without scanning all the files line by line? I know little about implementation of hadoop, but as a programmer, I can presume that it's not possible without a complete scan. But I can suggest a work-around: - compute number of records manually before putting a file to HDFS. - Append

Re: Netbeans/Eclipse plugin

2009-01-28 Thread Rasit OZDAS
Both DFS viewer and job submission work on eclipse v. 3.3.2. I've given up using Ganymede, unfortunately.. 2009/1/26 Aaron Kimball aa...@cloudera.com The Eclipse plugin (which, btw, is now part of Hadoop core in src/contrib/) currently is inoperable. The DFS viewer works, but the job

Re: Using HDFS for common purpose

2009-01-28 Thread Rasit OZDAS
Thanks for responses, Sorry, I made a mistake, it's actually not a db what I wanted. We need a simple storage for files. Only get and put commands are enough (no queries needed). We don't even need append, chmod, etc. Probably from a thread on this list, I came across a link to a KFS-HDFS

Hadoop 0.19, Cascading 1.0 and MultipleOutputs problem

2009-01-28 Thread Mikhail Yakshin
Hi, We have a system based on Hadoop 0.18 / Cascading 0.8.1 and now I'm trying to port it to Hadoop 0.19 / Cascading 1.0. The first serious problem I've got into that we're extensively using MultipleOutputs in our jobs dealing with sequence files that store Cascading's Tuples. Since Cascading

Re: sudden instability in 0.18.2

2009-01-28 Thread Sagar Naik
Pl check which nodes have these failures. I guess the new tasktrackers/machines are not configured correctly. As a result, the map-task will die and the remaining map-tasks will be sucked onto these machines -Sagar David J. O'Dell wrote: We've been running 0.18.2 for over a month on an 8

Hadoop+s3 fuse-dfs

2009-01-28 Thread Roopa Sudheendra
I am experimenting with Hadoop backed by Amazon s3 filesystem as one of our backup storage solution. Just the hadoop and s3(block based since it overcomes the 5gb limit) so far seems to be fine. My problem is that i want to mount this filesystem using fuse-dfs ( since i don't have to worry

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald
Hi Roopa, I cant comment on the S3 specifics. However, fuse-dfs is based on a C interface called libhdfs which allows C programs (such as fuse-dfs) to connect to the Hadoop file system Java API. This being the case, fuse-dfs should (theoretically) be able to connect to any file system that

Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Hi David, If your tasks are failing on only the new nodes, it's likely that you're missing a library or something on those machines. See this Hadoop tutorial http://public.yahoo.com/gogate/hadoop-tutorial/html/module5.html about distributing debug scripts. These will allow you to capture

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Roopa Sudheendra
Thanks for the response craig. I looked at fuse-dfs c code and looks like it does not like anything other than dfs:// so with the fact that hadoop can connect to S3 file system ..allowing s3 scheme should solve my problem? Roopa On Jan 28, 2009, at 1:03 PM, Craig Macdonald wrote: Hi

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald
In theory, yes. On inspection of libhdfs, which underlies fuse-dfs, I note that: * libhdfs takes a host and port number as input when connecting, but not a scheme (hdfs etc). The easiest option would be to set the S3 as your default file system in your hadoop-site.xml, then use the host of

Re: sudden instability in 0.18.2

2009-01-28 Thread David J. O'Dell
It was failing on all the nodes both new and old. The problem was there were too many subdirectories under $HADOOP_HOME/logs/userlogs The fix was just to delete the subdirs and change this setting from 24 hours(the default) to 2 hours. mapred.userlog.retain.hours Would have been nice if there was

tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sriram Rao
Hi, Is there a tool that one could run on a datanode to scrub all the blocks on that node? Sriram

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Roopa Sudheendra
Hey Craig, I tried the way u suggested..but i get this transport endpoint not connected. Can i see the logs anywhere? I dont see anything in /var/ log/messages either looks like it tries to create the file system in hdfs.c but not sure where it fails. I have the hadoop home set so i

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald
Hi Roopa, Firstly, can you get the fuse-dfs working for an instance HDFS? There is also a debug mode for fuse: enable this by adding -d on the command line. C Roopa Sudheendra wrote: Hey Craig, I tried the way u suggested..but i get this transport endpoint not connected. Can i see the

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Roopa Sudheendra
Thanks, Yes a setup with fuse-dfs and hdfs works fine.I think the mount point was bad for whatever reason and was failing with that error .I created another mount point for mounting which resolved the transport end point error. Also i had -d option on my command..:) Roopa On Jan 28,

Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald
Hi Roopa, Glad it worked :-) Please file JIRA issues against the fuse-dfs / libhdfs components that would have made it easier to mount the S3 filesystem. Craig Roopa Sudheendra wrote: Thanks, Yes a setup with fuse-dfs and hdfs works fine.I think the mount point was bad for whatever reason

Re: sudden instability in 0.18.2

2009-01-28 Thread Aaron Kimball
Wow. How many subdirectories were there? how many jobs do you run a day? - Aaron On Wed, Jan 28, 2009 at 12:13 PM, David J. O'Dell dod...@videoegg.comwrote: It was failing on all the nodes both new and old. The problem was there were too many subdirectories under $HADOOP_HOME/logs/userlogs

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Aaron Kimball
By scrub do you mean delete the blocks from the node? Read your conf/hadoop-site.xml file to determine where dfs.data.dir points, then for each directory in that list, just rm the directory. If you want to ensure that your data is preserved with appropriate replication levels on the rest of your

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sriram Rao
By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. Sriram On Wed, Jan 28, 2009 at 5:57 PM, Aaron Kimball aa...@cloudera.com wrote: By scrub do you mean delete

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sagar Naik
Check out fsck bin/hadoop fsck path -files -location -blocks Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. Sriram On Wed, Jan 28,

Re: [ANNOUNCE] Registration for ApacheCon Europe 2009 is now open!

2009-01-28 Thread Christophe Bisciglia
I wanted to provide two additional notes about my talk on this list. First, you're really coming to see Aaron Kimball and Tom White - I'm working on getting that fixed on the conference pages. Second, my talk is actually a full day of intermediate/advanced Hadoop training on Monday. It will be

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sriram Rao
Does this read every block of every file from all replicas and verify that the checksums are good? Sriram On Wed, Jan 28, 2009 at 6:20 PM, Sagar Naik sn...@attributor.com wrote: Check out fsck bin/hadoop fsck path -files -location -blocks Sriram Rao wrote: By scrub I mean, have a tool

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sagar Naik
In addition to datanode itself finding corrupted blocks (As Owen mention) if the client finds a corrupted - block, it will go to other replica Whts your replication factor ? -Sagar Sriram Rao wrote: Does this read every block of every file from all replicas and verify that the checksums are

Cannot run program chmod: error=12, Not enough space

2009-01-28 Thread Andy Liu
I'm running Hadoop 0.19.0 on Solaris (SunOS 5.10 on x86) and many jobs are failing with this exception: Error initializing attempt_200901281655_0004_m_25_0: java.io.IOException: Cannot run program chmod: error=12, Not enough space at

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Sriram Rao
The failover is fine; we are more interested in finding corrupt blocks sooner rather than later. Since there is the thread in the datanode, that is good. The replication factor is 3. Sriram On Wed, Jan 28, 2009 at 6:45 PM, Sagar Naik sn...@attributor.com wrote: In addition to datanode itself

Is Hadoop Suitable for me?

2009-01-28 Thread Simon
Hi Hadoop Users, I am trying to build a storage system for the office of about 20-30 users which will store everything. From normal everyday documents to computer configuration files to big files (600mb) which are generated every hour. Is Hadoop suitable for this kind of environment?

RE: Is Hadoop Suitable for me?

2009-01-28 Thread Dmitry Pushkarev
Definitely not, You should be looking at expandable Ethernet storage that can be extended by connecting additional SAS arrays. (like dell powervault and similar things from other companies) 600Mb is just 6 seconds over gigabit network... --- Dmitry Pushkarev -Original Message- From:

RE: Is Hadoop Suitable for me?

2009-01-28 Thread Simon
But we are looking for an open source solution. If I do decide to implement this for the office storage, what problems will I run into? -Original Message- From: Dmitry Pushkarev [mailto:u...@stanford.edu] Sent: Thursday, 29 January 2009 5:15 PM To: core-user@hadoop.apache.org Cc:

Re: tools for scrubbing HDFS data nodes?

2009-01-28 Thread Raghu Angadi
Owen O'Malley wrote: On Jan 28, 2009, at 6:16 PM, Sriram Rao wrote: By scrub I mean, have a tool that reads every block on a given data node. That way, I'd be able to find corrupted blocks proactively rather than having an app read the file and find it. The datanode already has a thread