Re: Hadoop cluster build, machine specs

2008-04-04 Thread Bradford Stephens
Greetings, It really depends on your budget. What are you looking to spend? $5k? $20k? Hadoop is about bringing the calculations to your data, so the more machines you can have, the better. In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an SATA hard drive. My company just ord

Hadoop cluster build, machine specs

2008-04-04 Thread Ted Dziuba
Hi all, I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound Hadoop jobs. I'm shying away from the 8-core behemoth type machines for cost reasons. But what about dual core machines? 32 or 64 bits? I'm still in the planning stages, so any advice would be greatly apprecia

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Please give your inputs for my problem. Thanks, On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <[EMAIL PROTECTED]> wrote: > Ted, > > It appears that Nutch hasn't been updated in a while (in Internet time at > least). Do you know if it works with the latest versions of Hadoop? Thanks. > > - Robe

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Robert Dempsey
Ted, It appears that Nutch hasn't been updated in a while (in Internet time at least). Do you know if it works with the latest versions of Hadoop? Thanks. - Robert Dempsey (new to the list) On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote: See Nutch. See Nutch run. http://en.wikipedia.or

RE: secondary namenode web interface

2008-04-04 Thread dhruba Borthakur
Your configuration is good. The secondary Namenode does not publish a web interface. The "null pointer" message in the secondary Namenode log is a harmless bug but should be fixed. It would be nice if you can open a JIRA for it. Thanks, Dhruba -Original Message- From: Yuri Pradkin [mailt

Re: secondary namenode web interface

2008-04-04 Thread Yuri Pradkin
I'm re-posting this in hope that someone would help. Thanks! On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote: > Hi, > > I'm running Hadoop (latest snapshot) on several machines and in our setup > namenode and secondarynamenode are on different systems. I see from the > logs than second

Re: Streaming + custom input format

2008-04-04 Thread Yuri Pradkin
It does work for me. I have to BOTH ship the extra jar using -file AND include in classpath on local system (via setting HADOOP_CLASSPATH). I'm not sure what "nothing happened" means. BTW, I'm using the 0.16.2 release. On Friday 04 April 2008 10:19:54 am Francesco Tamberi wrote: > I already tr

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning
See Nutch. See Nutch run. http://en.wikipedia.org/wiki/Nutch http://lucene.apache.org/nutch/ On 4/4/08 1:22 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > Hi, > > I have not used lucene index ever before. I do not get how we build it with > hadoop Map reduce. Basically what I was looking f

Re: on number of input files and split size

2008-04-04 Thread Prasan Ary
So it seems best for my application if I can somehow consolidate smaller files into a couple of large files. All of my files reside on S3, and I am using 'distcp' command to copy them to hdfs on EC2 before running a MR job. I was thinking it would be nice if I could modify distcp such that

S3 Exception

2008-04-04 Thread Craig Blake
Is there any additional configuration needed to run against S3 besides these instructions? http://wiki.apache.org/hadoop/AmazonS3 Following the instructions on that page, when I try to run "start- dfs.sh" I see the following exception in the logs: 2008-04-04 17:03:31,345 ERROR org.apache.ha

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
Hi, I have not used lucene index ever before. I do not get how we build it with hadoop Map reduce. Basically what I was looking for like how to implement multilevel map/reduce for my mentioned problem. On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote: > You can build Lucene ind

Re: more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Ted Dunning
I should add that systems like Pig and JAQL aim to satisfy your needs very nicely. They may or may not be ready for your needs, but they aren't terribly far away. Also, you should consider whether it is better for you to have a system that is considered "industry standard" (aka fully relational)

Re: more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Ted Dunning
On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote: > [ ... Extract and report on 25,000 out of 10^6 records ...] > > So...at my naive level, this seems like a decent job for hadoop. > ***QUESTION 1: Is this an accurate belief?*** Sounds just right. On 10 loser machines, it is feasib

Re: on number of input files and split size

2008-04-04 Thread Ted Dunning
The split will depend entirely on the input format that you use and the files that you have. In your case, you have lots of very small files so the limiting factor will almost certainly be the number of files. Thus, you will have 1000 splits (one per file). Your performance, btw, will likely be

Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread s29752-hadoopuser
Your distcp command looks correct. distcp may have created some log files (e.g. inside /_distcp_logs_5vzva5 from your previous email.) Could you check the logs, see whether there are error messages? If you could send me the distcp output and the logs, I may be able to find out the problem. (

on number of input files and split size

2008-04-04 Thread Prasan Ary
I have a question on how input files are split before they are given out to Map functions. Say I have an input directory containing 1000 files whose total size is 100 MB, and I have 10 machines in my cluster and I have configured 10 mapred.map.tasks in hadoop-site.xml. 1. With this conf

Re: Error msg: Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/conf/Configuration

2008-04-04 Thread Anisio Mendes Lacerda
Thanks a lot for help. Only for register I would like to post which enviroment variables I set: export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13 export OS_NAME=linux export OS_ARCH=i386 export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs export SHLIB_VERSION=1 export HADOOP_HOME=/

Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
I am sorry, that was a mistype in my mail. The second command was (please note the / at the end): bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls / I guess you are right, Nicholas. The s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there. But the earlier distcp command to copy the

more noob questions--how/when is data 'distributed' across a cluster?

2008-04-04 Thread Paul Danese
Hi, Currently I have a large (for me) amount of data stored in a relational database (3 tables: each with 2 - 10 million related records. This is an oversimplification, but for clarity it's close enough). There is a relatively simple Object-relational Mapping (ORM) to my database: Specifically,

Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread s29752-hadoopuser
>To check that the file actually exists on S3, I tried the following commands: > >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls >bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls > >The first returned nothing, while the second returned the following: > >Found 1 items >/_distcp_logs_5vzva5

Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning
On 4/4/08 10:18 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote: > Thank for your fast reply! > > Ted Dunning ha scritto: >> Take a looks at the way that the text input format moves to the next line >> after a split point. >> >> > I'm not sure to understand.. is my way correct or are you

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ning Li
You can build Lucene indexes using Hadoop Map/Reduce. See the index contrib package in the trunk. Or is it still not something you are looking for? Regards, Ning On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote: > No, currently my requirement is to solve this problem by apache hadoop. I am > tryi

Re: Streaming + custom input format

2008-04-04 Thread Francesco Tamberi
I already tried that... nothing happened... Thank you, -- Francesco Ted Dunning ha scritto: I saw that, but I don't know if it will put a jar into the classpath at the other end. On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote: There is a -file option to streaming that -file

Re: Streaming + custom input format

2008-04-04 Thread Francesco Tamberi
Thank for your fast reply! Ted Dunning ha scritto: Take a looks at the way that the text input format moves to the next line after a split point. I'm not sure to understand.. is my way correct or are you suggesting another one? There are a couple of possible problems with your input format

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Aayush Garg
No, currently my requirement is to solve this problem by apache hadoop. I am trying to build up this type of inverted index and then measure performance criteria with respect to others. Thanks, On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > > Are you implementing this

Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning
I saw that, but I don't know if it will put a jar into the classpath at the other end. On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote: > There is a -file option to streaming that > -file File/dir to be shipped in the Job jar file > > On Friday 04 April 2008 09:24:59 am Te

Re: Streaming + custom input format

2008-04-04 Thread Yuri Pradkin
There is a -file option to streaming that -file File/dir to be shipped in the Job jar file On Friday 04 April 2008 09:24:59 am Ted Dunning wrote: > At one point, it > was necessary to unpack the streaming.jar file and put your own classes and > jars into that.  Last time I looked

Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-04 Thread Ted Dunning
My suggestion actually is similar to what bigtable and hbase do. That is to keep some recent updates in memory, burping them to disk at relatively frequent intervals. Then when a number of burps are available, they can be merged to a larger burp. This pyramid can be extended as needed. Searche

Re: Streaming + custom input format

2008-04-04 Thread Ted Dunning
Take a looks at the way that the text input format moves to the next line after a split point. There are a couple of possible problems with your input format not found problem. First, is your input in a package? If so, you need to provide a complete name for the class. Secondly, you have to gi

Re: Newbie and basic questions

2008-04-04 Thread Travis Brady
Hi Alberto, Here's my take as someone from the traditional RDBMS world who has been experimenting with Hadoop for a month, so don't take my comments to be definitive. On Fri, Apr 4, 2008 at 7:57 AM, Alberto Mesas <[EMAIL PROTECTED]> wrote: > We have been reading some doc, and playing with the ba

Re: If I wanna read a config file before map task, which class I should choose?

2008-04-04 Thread Ted Dunning
Just write a parser and put it into the configure method. On 4/3/08 8:31 PM, "Jeremy Chow" <[EMAIL PROTECTED]> wrote: > thanks, the configure file format looks like below, > > @tag_name0 name0 {value00, value01, value02} > @tag_name1 name1 {value10, value11, value12} > > and reading it from H

Re: Hadoop: Multiple map reduce or some better way

2008-04-04 Thread Ted Dunning
Are you implementing this for instruction or production? If production, why not use Lucene? On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote: > HI Amar , Theodore, Arun, > > Thanks for your reply. Actaully I am new to hadoop so cant figure out much. > I have written following code

Newbie and basic questions

2008-04-04 Thread Alberto Mesas
We have been reading some doc, and playing with the basic samples that come with Hadoop 0.16.2. So let's see if we have understood everything :) We plan to use HCore for processing our logs, but would it be possible to use it for a case like this one? MySQL Table with a few thousands of new rows

Re: Is it possible in Hadoop to overwrite or update a file?

2008-04-04 Thread Andrzej Bialecki
Ted Dunning wrote: This factor of 1500 in speed seems pretty significant and is the motivation for not supporting random read/write. This doesn't mean that random access update should never be done, but it does mean that scaling a design based around random access will be more difficult than sc

Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
Thanks for the quick response, Tom. I have just switched to Hadoop 0.16.2 and tried this again. Now I am getting the following error: Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source s3://id:[EMAIL PROTECTED]/file.txt does not exist. I copied the file to S3 using the fol

Streaming + custom input format

2008-04-04 Thread Francesco Tamberi
Hi All, I have a streaming tool chain written in c++/python that performs some operations on really big text files (gigabytes order); the chain reads files and writes its result to standard output. The chain needs to read well structured files and so I need to control how hadoop splits files: i

Re: distcp fails when copying from s3 to hdfs

2008-04-04 Thread Tom White
Hi Siddhartha, This is a problem in 0.16.1 (https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in 0.16.2, which was released yesterday. Tom On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote: > I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on > A

Re: distcp fails :Input source not found

2008-04-04 Thread Tom White
> However, when I try it on 0.15.3, it doesn't allow a folder copy. I have 100+ files in my S3 bucket, and I had to run "distcp" on each one of them to get them on HDFS on EC2 . Not a nice experience! This sounds like a bug - could you log a Jira issue for this please? Thanks, Tom

distcp fails when copying from s3 to hdfs

2008-04-04 Thread Siddhartha Reddy
I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of CentOS 5 images (ami-08f41161). I am able to copy from hdfs to S3 using the following command: bin/hadoop distcp file.txt s3://id:[EMAIL PROTE