Re: posted again: how are the splits for map tasks computed?

2010-03-25 Thread abhishek sharma
Ravi, On Wed, Mar 24, 2010 at 9:32 PM, Ravi Phulari rphul...@yahoo-inc.com wrote: Hello Abhishek , Unless you have modified conf/mapred-site.xml file MapReduce will  use configuration values specified in $HADOOP_HOME/src/mapred/mapred-default.xml In this configuration file mapred.map.tasks

[ANN] Hadoop 0.20.2 in Debian unstable

2010-03-25 Thread Thomas Koch
Hi, hadoop has entered Debian unstable a few days ago: http://packages.qa.debian.org/h/hadoop.html Please use it, test it, spread the word about it! Having Hadoop in a major linux distribution can significantly improve the visibility and lower the entry barrier. There are still some TODO

a problem when executing wordcount in hadoop cluster environment

2010-03-25 Thread 毛宏
I have finished configuring the Hadoop in cluster environment as follows: 1. maoh...@maohong-desktop:~/Software/Development/Hadoop/hadoop-0.20.2$ bin/start-all.sh 2. starting namenode, logging to

Re: posted again: how are the splits for map tasks computed?

2010-03-25 Thread Gang Luo
Hi, this is an interesting question. My understanding so far is that, the number of map tasks is basically defined by the number of splits. When hadoop splits the file, it will receive a hint from user (here is the number of mappers you set ). At the same time, hadoop maintain 3 parameters:

Re: Cloudera AMIs

2010-03-25 Thread Andrey Klochkov
You can use contrib scripts inside Hadoop distribution to deploy it to EC2, it should work with any version of Hadoop. See this overview of different ways of deploying hadoop to ec2 http://blog.griddynamics.com/2010/03/apache-hadoop-on-amazon-ec2.html -- Andrew Klochkov -- View this message

Re: [ANN] Hadoop 0.20.2 in Debian unstable

2010-03-25 Thread Isabel Drost
On 25.03.2010 Thomas Koch wrote: hadoop has entered Debian unstable a few days ago: http://packages.qa.debian.org/h/hadoop.html Congratulations! Isabel signature.asc Description: This is a digitally signed message part.

CfP - Berlin Buzzwords

2010-03-25 Thread Isabel Drost
Call for Presentations Berlin Buzzwords http://berlinbuzzwords.de Berlin Buzzwords 2010 - Search, Store, Scale 7/8 June 2010 This is to announce the Berlin Buzzwords 2010. The first conference on scalable and open search, data

JobTracker startup failure when starting hadoop-0.20.0 cluster on Amazon EC2 with contrib/ec2 scripts

2010-03-25 Thread 毛宏
I downloaded Hadoop 0.20.0 and used the src/contrib/ec2/bin scripts to launch a Hadoop cluster on Amazon EC2, after building a new Hadoop 0.20.0 AMI. I launched an instance with my new Hadoop 0.20.0 AMI, then logged in and ran the following to launch a new cluster: root(/vol/hadoop-0.20.0)

RE: posted again: how are the splits for map tasks computed?

2010-03-25 Thread Segel, Mike
Ok, its 4:00am local time... silly question... What's the block size of your HDFS ? And the file sizes you said in bytes? So the full file is 12MB? -Original Message- From: absha...@gmail.com [mailto:absha...@gmail.com] On Behalf Of abhishek sharma Sent: Wednesday, March 24, 2010 9:27

DeDuplication Techniques

2010-03-25 Thread Joseph Stein
I have been researching ways to handle de-dupping data while running a map/reduce program (so as to not re-calculate/re-aggregate data that we have seen before[possibly months before]). The data sets we have are littered with repeats of data from mobile devices which continue to come in over time

Is there a size limit on a line for a text file?

2010-03-25 Thread Raymond Jennings III
for the input to a mapper or as the output of either mapper or reducer?

Re: DeDuplication Techniques

2010-03-25 Thread Mark Kerzner
Joe, your approach would work, whether you use files to keep old data, or a database. However, it feels like a mix of new and old technologies. It just does not feel right to open a file to do just one comparison, and close it again. Even if you keep it open and do searches there, and even if you

java.io.IOException: Spill failed

2010-03-25 Thread Raymond Jennings III
Any pointers on what might be causing this? Thanks! java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(MapTask.java:1006) at java.io.DataOutputStream.write(Unknown Source) at org.apache.hadoop.io.Text.write(Text.java:282)

RE: DeDuplication Techniques

2010-03-25 Thread Michael Segel
Joe, You know you mentioned HBase, but have you thought about it? This is actually a simple thing to do in HBase because in your map/reduce you hash your key and then check to see if HBase already has your key. (You make your hbase connection in setup() so you don't constantly open/close

Re: DeDuplication Techniques

2010-03-25 Thread Ankur C. Goel
The kind of need to you specified is quite common in ETL style of processing. The fastest and most efficient way to do this is when you have all your historical data in HDFS itself. In this case you can do a LEFT outer join between the two datasets (assuming new data is your left relation) in