Ravi,
On Wed, Mar 24, 2010 at 9:32 PM, Ravi Phulari rphul...@yahoo-inc.com wrote:
Hello Abhishek ,
Unless you have modified conf/mapred-site.xml file MapReduce will use
configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml
In this configuration file mapred.map.tasks
Hi,
hadoop has entered Debian unstable a few days ago:
http://packages.qa.debian.org/h/hadoop.html
Please use it, test it, spread the word about it! Having Hadoop in a major
linux distribution can significantly improve the visibility and lower the
entry barrier.
There are still some TODO
I have finished configuring the Hadoop in cluster environment as
follows:
1. maoh...@maohong-desktop:~/Software/Development/Hadoop/hadoop-0.20.2$
bin/start-all.sh
2. starting namenode, logging
to
Hi,
this is an interesting question. My understanding so far is that, the number of
map tasks is basically defined by the number of splits. When hadoop splits the
file, it will receive a hint from user (here is the number of mappers you set
). At the same time, hadoop maintain 3 parameters:
You can use contrib scripts inside Hadoop distribution to deploy it to EC2,
it should work with any version of Hadoop.
See this overview of different ways of deploying hadoop to ec2
http://blog.griddynamics.com/2010/03/apache-hadoop-on-amazon-ec2.html
--
Andrew Klochkov
--
View this message
On 25.03.2010 Thomas Koch wrote:
hadoop has entered Debian unstable a few days ago:
http://packages.qa.debian.org/h/hadoop.html
Congratulations!
Isabel
signature.asc
Description: This is a digitally signed message part.
Call for Presentations Berlin Buzzwords
http://berlinbuzzwords.de
Berlin Buzzwords 2010 - Search, Store, Scale
7/8 June 2010
This is to announce the Berlin Buzzwords 2010. The first conference on scalable
and open search, data
I downloaded Hadoop 0.20.0 and used the src/contrib/ec2/bin scripts to
launch a Hadoop cluster on Amazon EC2, after building a new Hadoop
0.20.0 AMI.
I launched an instance with my new Hadoop 0.20.0 AMI, then logged in and
ran the following to launch a new cluster:
root(/vol/hadoop-0.20.0)
Ok, its 4:00am local time... silly question...
What's the block size of your HDFS ?
And the file sizes you said in bytes? So the full file is 12MB?
-Original Message-
From: absha...@gmail.com [mailto:absha...@gmail.com] On Behalf Of abhishek
sharma
Sent: Wednesday, March 24, 2010 9:27
I have been researching ways to handle de-dupping data while running a
map/reduce program (so as to not re-calculate/re-aggregate data that
we have seen before[possibly months before]).
The data sets we have are littered with repeats of data from mobile
devices which continue to come in over time
for the input to a mapper or as the output of either mapper or reducer?
Joe,
your approach would work, whether you use files to keep old data, or a
database. However, it feels like a mix of new and old technologies. It just
does not feel right to open a file to do just one comparison, and close it
again. Even if you keep it open and do searches there, and even if you
Any pointers on what might be causing this? Thanks!
java.io.IOException: Spill failed
at
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1006)
at java.io.DataOutputStream.write(Unknown Source)
at org.apache.hadoop.io.Text.write(Text.java:282)
Joe,
You know you mentioned HBase, but have you thought about it?
This is actually a simple thing to do in HBase because in your map/reduce you
hash your key and then check to see if HBase already has your key. (You make
your hbase connection in setup() so you don't constantly open/close
The kind of need to you specified is quite common in ETL style of processing.
The fastest and most efficient way to do this is when you have all your
historical data in HDFS itself. In this case you can do a LEFT outer join
between the two datasets (assuming new data is your left relation) in
15 matches
Mail list logo