Greetings,
It really depends on your budget. What are you looking to spend? $5k?
$20k? Hadoop is about bringing the calculations to your data, so the
more machines you can have, the better.
In general, I'd recommend Dual-Core Opterons and 2-4 GB of RAM with an
SATA hard drive. My company just ord
Hi all,
I'm looking to build a small, 5-10 node cluster to run mostly CPU-bound
Hadoop jobs. I'm shying away from the 8-core behemoth type machines for
cost reasons. But what about dual core machines? 32 or 64 bits?
I'm still in the planning stages, so any advice would be greatly
apprecia
Please give your inputs for my problem.
Thanks,
On Sat, Apr 5, 2008 at 1:10 AM, Robert Dempsey <[EMAIL PROTECTED]> wrote:
> Ted,
>
> It appears that Nutch hasn't been updated in a while (in Internet time at
> least). Do you know if it works with the latest versions of Hadoop? Thanks.
>
> - Robe
Ted,
It appears that Nutch hasn't been updated in a while (in Internet time
at least). Do you know if it works with the latest versions of Hadoop?
Thanks.
- Robert Dempsey (new to the list)
On Apr 4, 2008, at 5:36 PM, Ted Dunning wrote:
See Nutch. See Nutch run.
http://en.wikipedia.or
Your configuration is good. The secondary Namenode does not publish a
web interface. The "null pointer" message in the secondary Namenode log
is a harmless bug but should be fixed. It would be nice if you can open
a JIRA for it.
Thanks,
Dhruba
-Original Message-
From: Yuri Pradkin [mailt
I'm re-posting this in hope that someone would help. Thanks!
On Wednesday 02 April 2008 01:29:45 pm Yuri Pradkin wrote:
> Hi,
>
> I'm running Hadoop (latest snapshot) on several machines and in our setup
> namenode and secondarynamenode are on different systems. I see from the
> logs than second
It does work for me. I have to BOTH ship the extra jar using -file AND
include in classpath on local system (via setting HADOOP_CLASSPATH).
I'm not sure what "nothing happened" means. BTW, I'm using the 0.16.2
release.
On Friday 04 April 2008 10:19:54 am Francesco Tamberi wrote:
> I already tr
See Nutch. See Nutch run.
http://en.wikipedia.org/wiki/Nutch
http://lucene.apache.org/nutch/
On 4/4/08 1:22 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I have not used lucene index ever before. I do not get how we build it with
> hadoop Map reduce. Basically what I was looking f
So it seems best for my application if I can somehow consolidate smaller files
into a couple of large files.
All of my files reside on S3, and I am using 'distcp' command to copy them to
hdfs on EC2 before running a MR job. I was thinking it would be nice if I could
modify distcp such that
Is there any additional configuration needed to run against S3 besides
these instructions?
http://wiki.apache.org/hadoop/AmazonS3
Following the instructions on that page, when I try to run "start-
dfs.sh" I see the following exception in the logs:
2008-04-04 17:03:31,345 ERROR org.apache.ha
Hi,
I have not used lucene index ever before. I do not get how we build it with
hadoop Map reduce. Basically what I was looking for like how to implement
multilevel map/reduce for my mentioned problem.
On Fri, Apr 4, 2008 at 7:23 PM, Ning Li <[EMAIL PROTECTED]> wrote:
> You can build Lucene ind
I should add that systems like Pig and JAQL aim to satisfy your needs very
nicely. They may or may not be ready for your needs, but they aren't
terribly far away.
Also, you should consider whether it is better for you to have a system that
is considered "industry standard" (aka fully relational)
On 4/4/08 11:48 AM, "Paul Danese" <[EMAIL PROTECTED]> wrote:
> [ ... Extract and report on 25,000 out of 10^6 records ...]
>
> So...at my naive level, this seems like a decent job for hadoop.
> ***QUESTION 1: Is this an accurate belief?***
Sounds just right.
On 10 loser machines, it is feasib
The split will depend entirely on the input format that you use and the
files that you have. In your case, you have lots of very small files so the
limiting factor will almost certainly be the number of files. Thus, you
will have 1000 splits (one per file).
Your performance, btw, will likely be
Your distcp command looks correct. distcp may have created some log files
(e.g. inside /_distcp_logs_5vzva5 from your previous email.) Could you check
the logs, see whether there are error messages?
If you could send me the distcp output and the logs, I may be able to find out
the problem. (
I have a question on how input files are split before they are given out to Map
functions.
Say I have an input directory containing 1000 files whose total size is 100
MB, and I have 10 machines in my cluster and I have configured 10
mapred.map.tasks in hadoop-site.xml.
1. With this conf
Thanks a lot for help. Only for register I would like to post which
enviroment variables
I set:
export JAVA_HOME=/usr/lib/jvm/java-1.5.0-sun-1.5.0.13
export OS_NAME=linux
export OS_ARCH=i386
export LIBHDFS_BUILD_DIR=/mnt/hd1/hadoop/hadoop-0.14.4/libhdfs
export SHLIB_VERSION=1
export HADOOP_HOME=/
I am sorry, that was a mistype in my mail. The second command was (please
note the / at the end):
bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls /
I guess you are right, Nicholas. The
s3://id:[EMAIL PROTECTED]/file.txtindeed does not seem to be there.
But the earlier distcp command to copy the
Hi,
Currently I have a large (for me) amount of data stored in a relational
database (3 tables: each with 2 - 10 million related records. This is an
oversimplification, but for clarity it's close enough).
There is a relatively simple Object-relational Mapping (ORM) to my
database: Specifically,
>To check that the file actually exists on S3, I tried the following commands:
>
>bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
>bin/hadoop fs -fs s3://id:[EMAIL PROTECTED] -ls
>
>The first returned nothing, while the second returned the following:
>
>Found 1 items
>/_distcp_logs_5vzva5
On 4/4/08 10:18 AM, "Francesco Tamberi" <[EMAIL PROTECTED]> wrote:
> Thank for your fast reply!
>
> Ted Dunning ha scritto:
>> Take a looks at the way that the text input format moves to the next line
>> after a split point.
>>
>>
> I'm not sure to understand.. is my way correct or are you
You can build Lucene indexes using Hadoop Map/Reduce. See the index
contrib package in the trunk. Or is it still not something you are
looking for?
Regards,
Ning
On 4/4/08, Aayush Garg <[EMAIL PROTECTED]> wrote:
> No, currently my requirement is to solve this problem by apache hadoop. I am
> tryi
I already tried that... nothing happened...
Thank you,
-- Francesco
Ted Dunning ha scritto:
I saw that, but I don't know if it will put a jar into the classpath at the
other end.
On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote:
There is a -file option to streaming that
-file
Thank for your fast reply!
Ted Dunning ha scritto:
Take a looks at the way that the text input format moves to the next line
after a split point.
I'm not sure to understand.. is my way correct or are you suggesting
another one?
There are a couple of possible problems with your input format
No, currently my requirement is to solve this problem by apache hadoop. I am
trying to build up this type of inverted index and then measure performance
criteria with respect to others.
Thanks,
On Fri, Apr 4, 2008 at 5:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
> Are you implementing this
I saw that, but I don't know if it will put a jar into the classpath at the
other end.
On 4/4/08 9:56 AM, "Yuri Pradkin" <[EMAIL PROTECTED]> wrote:
> There is a -file option to streaming that
> -file File/dir to be shipped in the Job jar file
>
> On Friday 04 April 2008 09:24:59 am Te
There is a -file option to streaming that
-file File/dir to be shipped in the Job jar file
On Friday 04 April 2008 09:24:59 am Ted Dunning wrote:
> At one point, it
> was necessary to unpack the streaming.jar file and put your own classes and
> jars into that. Last time I looked
My suggestion actually is similar to what bigtable and hbase do.
That is to keep some recent updates in memory, burping them to disk at
relatively frequent intervals. Then when a number of burps are available,
they can be merged to a larger burp. This pyramid can be extended as
needed.
Searche
Take a looks at the way that the text input format moves to the next line
after a split point.
There are a couple of possible problems with your input format not found
problem.
First, is your input in a package? If so, you need to provide a complete
name for the class.
Secondly, you have to gi
Hi Alberto,
Here's my take as someone from the traditional RDBMS world who has been
experimenting with Hadoop for a month, so don't take my comments to be
definitive.
On Fri, Apr 4, 2008 at 7:57 AM, Alberto Mesas <[EMAIL PROTECTED]>
wrote:
> We have been reading some doc, and playing with the ba
Just write a parser and put it into the configure method.
On 4/3/08 8:31 PM, "Jeremy Chow" <[EMAIL PROTECTED]> wrote:
> thanks, the configure file format looks like below,
>
> @tag_name0 name0 {value00, value01, value02}
> @tag_name1 name1 {value10, value11, value12}
>
> and reading it from H
Are you implementing this for instruction or production?
If production, why not use Lucene?
On 4/3/08 6:45 PM, "Aayush Garg" <[EMAIL PROTECTED]> wrote:
> HI Amar , Theodore, Arun,
>
> Thanks for your reply. Actaully I am new to hadoop so cant figure out much.
> I have written following code
We have been reading some doc, and playing with the basic samples that come
with Hadoop 0.16.2. So let's see if we have understood everything :)
We plan to use HCore for processing our logs, but would it be possible to
use it for a case like this one?
MySQL Table with a few thousands of new rows
Ted Dunning wrote:
This factor of 1500 in speed seems pretty significant and is the motivation
for not supporting random read/write.
This doesn't mean that random access update should never be done, but it
does mean that scaling a design based around random access will be more
difficult than sc
Thanks for the quick response, Tom.
I have just switched to Hadoop 0.16.2 and tried this again. Now I am getting
the following error:
Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source
s3://id:[EMAIL PROTECTED]/file.txt does not exist.
I copied the file to S3 using the fol
Hi All,
I have a streaming tool chain written in c++/python that performs some
operations on really big text files (gigabytes order); the chain reads files
and writes its result to standard output.
The chain needs to read well structured files and so I need to control how
hadoop splits files: i
Hi Siddhartha,
This is a problem in 0.16.1
(https://issues.apache.org/jira/browse/HADOOP-3027) that is fixed in
0.16.2, which was released yesterday.
Tom
On 04/04/2008, Siddhartha Reddy <[EMAIL PROTECTED]> wrote:
> I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on
> A
> However, when I try it on 0.15.3, it doesn't allow a folder copy.
I have 100+ files in my S3 bucket, and I had to run "distcp" on each
one of them to get them on HDFS on EC2 . Not a nice experience!
This sounds like a bug - could you log a Jira issue for this please?
Thanks,
Tom
I am trying to run a Hadoop cluster on Amazon EC2 and backup all the data on
Amazon S3 between the runs. I am using Hadoop 0.16.1 on a cluster made up of
CentOS 5 images (ami-08f41161).
I am able to copy from hdfs to S3 using the following command:
bin/hadoop distcp file.txt s3://id:[EMAIL PROTE
39 matches
Mail list logo