Re: Restricting number of records from map output

2011-01-12 Thread Anthony Urso
Either use an instance variable or a Combiner. The latter is correct if you want the top-n per key from the mapper. On Wed, Jan 12, 2011 at 10:03 AM, Rakesh Davanum wrote: > Hi, > > I have a sort job consisting of only the Mapper (no Reducer) task. I want my > results to contain only the top n r

distcp from an s3 url with slashes.

2010-05-27 Thread Anthony Urso
My S3 secret key has a slash in it. After replacing the / with %2F I can use it as a filesystem URL in something like: $ hadoop fs -fs s3n://$KEY:$sec...@$bucket -ls / Found 1 items drwxrwxrwx - 0 1969-12-31 16:00 /remote But when I try a distcp, it crashes with: $ had

Re: lost+found files prevent DataNode formatting

2009-09-28 Thread Anthony Urso
Those are created by fsck and will come back. They belong to the filesystem and you shouldn't delete them. Instead, create subdirectories on those mount points and use them as DFS directories. E.g. create and use /mnt/dfs instead of /mnt Cheers, Anthony On Mon, Sep 28, 2009 at 6:33 PM, Stas Os

Re: access hdfs & post job from client application

2009-09-26 Thread Anthony Urso
First set up your cluster and the client machine as per the getting started guide. Synchronize your configuration files everywhere. The FileSystem class has the API for interacting with the HDFS: http://hadoop.apache.org/common/docs/r0.20.1/api/org/apache/hadoop/fs/FileSystem.html The JobConf a

Re: Can not stop hadoop cluster ?

2009-09-21 Thread Anthony Urso
rt later. > > On Mon, Sep 21, 2009 at 12:15 PM, Jeff Zhang wrote: > >> My cluster has running for several months. >> >> Is this a bug of hadoop? I think hadoop is supposed to run for long time. >> >> And will I lose data if I manually kill the process ? &g

Re: Can not stop hadoop cluster ?

2009-09-20 Thread Anthony Urso
How long has the cluster been running? I have run into this problem when tmpwatch deleted all of the pid files from tmp because they were more than n-days old. If that is the case, you will have to manually kill all of the processes yourself. Cheers, Anthony On Sun, Sep 20, 2009 at 10:19 PM, Je

Re: missing libraries when starting hadoop daemon

2009-09-17 Thread Anthony Urso
start-all.sh is shell script; it can't be run via Hadoop. Just run it directly from your shell like this: $ bin/start-all.sh Cheers, Anthony On Thu, Sep 17, 2009 at 4:30 PM, Simon Chu wrote: > had...@zoe:/opt/hadoop-0.18.3> hadoop start-all.sh > Exception in thread "main" java.lang.NoClassDefF

Re: building hdfs-fuse

2009-09-11 Thread Anthony Urso
fuse.h should come with the FUSE software, not Hadoop. It should be somewhere like /usr/include/fuse.h on a Linux machine. Possibly /usr/local/include/fuse.h Did you install FUSE from source? If not, you probably need something like Debian's libfuse-dev package installed by your operating syste

Re: Thrift HDFS interface problems

2009-09-11 Thread Anthony Urso
For the Thrift server bug, the best way to get it fixed is to file a bug report at http://issues.apache.org/jira HBase 0.20 is out, download here: http://hadoop.apache.org/hbase/releases.html There is an HBase mailing list, hbase-u...@hadoop.apache.org. And yes, I believe you do still need to ke

Re: Cluster size for Linux file system

2009-09-08 Thread Anthony Urso
There is nothing really preventing you from filling your HDFS with a lot of very small files*, so it would depend on your use case; however, typical usage of Hadoop would prescribe as large of a block size as is available, in order to stream very large files off the disk efficiently. * Except name

Re: copy data (distcp) from local cluster to the EC2 cluster

2009-09-08 Thread Anthony Urso
Yes, just run something along the lines of: hadoop distcp hdfs://local-namenode/path hdfs://ec2-namenode/path on the job tracker of a MapReduce cluster. Make sure that your EC2 security group setup allows HDFS access from the local HDFS cluster and wherever you run MapReduce job from. Also, I b