Ok, so the problem is you have a machine with 16 CPUs and a
computation job that uses at most 3 of them. Is this right?
What is the Mahout task? Do you know that it has good multi-Hadoop
performance and tuning? What matters is that the Partitioner for the
Mahout code can separate the computations.
Greetings,
The Seattle Scalability Meetup isn't slacking for the holidays. We've
got an awesome lineup for Wed, December 8 at 7pm:
http://www.meetup.com/Seattle-Hadoop-HBase-NoSQL-Meetup/
-Jake Mannix from Twitter will talk about the Twitter Search
infrastructure (with distributed Lucene)
-Chris
I am building a cluster using Michael G. Noll's instructions found here:
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/
I have set up two single node clusters and they work fine. When I change their
configurations to behave as a single cluster (by changi
On 11/30/2010 03:51 AM, Steve Loughran wrote:
On 30/11/10 03:59, hadoopman wrote:
you don't need all the files in the cluster in sync as a lot of them
are intermediate and transient files.
Instead use dfscopy to copy source files to the two clusters, this
runs across the machines in the clu
On Tue, Nov 30, 2010 at 3:21 AM, Harsh J wrote:
> Hey,
>
> On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese
> wrote:
>>
>> Hey there,
>> I am doing some tests and wandering which are the best practices to deal
>> with very small files which are continuously being generated(1Mb or even
>> less).
>
Mark,
You might want to try changing your [dfs|mapred|jvm|rpc].servers in
hadoop-metrics.properties to point to your monitoring IP address (
192.168.1.72?) rather than localhost.
If you are relaying each node from local gmond than try to use the IP
address to which gmond is bound (netstat -an | gre
Here is a "recipe" for how to run multiple datanodes on a single server, posted
to this list on Sept. 15:
http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201009.mbox/%3c8a898c33-dc4e-418c-adc0-5689d434b...@yahoo-inc.com%3e
If you're having trouble getting multiple cores utili
The other approach, if the DR cluster is idle or has enough excess capacity,
would be running all the jobs on the input data in both clusters and perform
checksums on the outputs to ensure everything is consistent. And you could
take advantage and distribute ad hoc queries between the 2 clusters.
Hi All,
We have a problem in hand which we would like to solve using Distributed and
Parallel Processing.
Brief context : We have a Map (Entity, Associated value). The entity can
have a parent which in turn will have its parent and so on till we reach the
head. I have to traverse this tree and do
On 30/11/10 10:32, Adarsh Sharma wrote:
Is it possible to run Hadoop in VMs on Production Clusters so that we
have 1s of nodes on 100s of servers to achieve high performance
through Cloud Computing.
you don't achieve performance that way. You are better off with 1VM per
physical host, and
Machines is certainly better than VMs. If you are running 4 VMs on top of
one machine with 128 GB RAM, each gets 32 GB. But the cost of 4 machines
with 32 gigs RAM would be less than the cost of one machine with 128 GB, so
then there's no point of going to hadoop right? Plus all the VMs would
compe
On 30/11/10 03:59, hadoopman wrote:
We have two Hadoop clusters in two separate buildings. Both clusters
are loading the same data from the same sources (the second cluster is
for DR).
We're looking at how we can recover the primary cluster and catch it
back up again as new data will continue t
Is it possible to run Hadoop in VMs on Production Clusters so that we
have 1s of nodes on 100s of servers to achieve high performance
through Cloud Computing.
or
We have to simply configure Hadoop on 1s of commodity machines.
Which i
Try tweaking the mapred-site.xml config parameters.. these 2 parameters
could help.. if you haven't tried already:
mapred.job.reuse.jvm.num.tasks
-1
mapred.tasktracker.map.tasks.maximum
32
mapred.tasktracker.reduce.tasks.maximum
16
mapred.child.java.opts
last option i gave was to run hadoop in fully distributed mode
but you can run hadoop in pseudo distributed mode:
http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html
or
standalone mode:
http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-standalone-mode.
>If you want to just use one machine, why do you want to use hadoop? Hadoop's
>power lies in distributed computing. That being said, it is possible to use
>hadoop on a single machine by using the pseudo-distributed mode (Read
>http://hadoop.apache.org/common/docs/current/single_node_setup.html and
Hi beneo,
If you want to just use one machine, why do you want to use hadoop? Hadoop's
power lies in distributed computing. That being said, it is possible to use
hadoop on a single machine by using the pseudo-distributed mode (Read
http://hadoop.apache.org/common/docs/current/single_node_setup.ht
i'm sorry, but, are you sure??
At 2010-11-30 15:53:58,"rahul patodi" wrote:
>you can create virtual machines on your single machine:
>for you have to install sun virtual box(other tools are also available like
>VMware)
>now you can create as many virtual machine as you want
>then you can create on
Hey,
On Tue, Nov 30, 2010 at 4:56 AM, Marc Sturlese wrote:
>
> Hey there,
> I am doing some tests and wandering which are the best practices to deal
> with very small files which are continuously being generated(1Mb or even
> less).
Have a read: http://www.cloudera.com/blog/2009/02/the-small-fil
MultipleInputs for the new API is present in Hadoop 0.21 releases. It
should reside in the org.apache.hadoop.mapreduce.* package.
See: https://issues.apache.org/jira/browse/MAPREDUCE-369 for the issue.
On Mon, Nov 29, 2010 at 10:56 PM, Alan Said wrote:
> Hi all,
> I'm having difficulties figurin
20 matches
Mail list logo