Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-08 Thread Ted Yu
I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically mention this potential issue so that other people can avoid such problem. Feel free to add more onto it. On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment wrote: > Thanks everyone. > > Yes, using the Google Code version refer

Re: Distributed Cache file issue

2010-07-08 Thread Raja Thiruvathuru
Hi, Create the "Job" after you create the configuration. Like., Path p=new Path("hdfs://localhost:9100/user/denimLive/denim/DCache/Orders.txt"); DistributedCache.addCacheFile(p.toUri(), conf); Job job = new Job(conf, "Driver"); If you create the "Job" before creating configuration, for some re

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-08 Thread bmdevelopment
Thanks everyone. Yes, using the Google Code version referenced on the wiki: http://wiki.apache.org/hadoop/UsingLzoCompression I will try the latest version and see if that fixes the problem. http://github.com/kevinweil/hadoop-lzo Thanks On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon wrote: > On T

Re: SequenceFile as map input

2010-07-08 Thread Alex Kozlov
Hi Alan, Is the content of the original file ascii text? Then you should be using signature. By default 'hadoop fs -text ...' just will call toString() on the object. You get the object itself in the map() method and can do whatever you want with it. If Text or BytesWritable does not work for

Re: SequenceFile as map input

2010-07-08 Thread Alan Miller
Hi Alex, I'm not sure what you mean. I already set my mapper's signature to: public class MyMapper extends Mapper { ... public void map(Text key, BytesWritable value, Context context) } } In my map() loop the contents of value is the text from the original file and the value.

Distributed Cache file issue

2010-07-08 Thread Denim Live
Hello all, As a new user of hadoop, I am having some problems with understanding some things. I am writing a program to load a file to the distributed cache and read this file in each mapper. In my driver program, I have added the file to my distributed cache using: Path p=new

Re: naming the output fle of reduce to the partition number

2010-07-08 Thread Denim Live
yes, I can get the partition number using jobconf.getInt("mapred.task.partition") but how can I custom name my output file of each reducer with just this partition number? From: Ted Yu To: mapreduce-user@hadoop.apache.org Sent: Thu, July 8, 2010 6:22:54 PM

Re: SequenceFile as map input

2010-07-08 Thread Alex Loddengaard
Hi Alan, SequenceFiles keep track of the key and value type, so you should be able to use the Writables in the signature. Though it looks like you're using the new API, and I admit that I'm not an expert with the new API. Have you tried using the Writables in the signature? Alex On Thu, Jul 8,

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-08 Thread Todd Lipcon
On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu wrote: > Todd fixed a bug where LZO header or block header data may fall on read > boundary: > > http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 > I am wondering if that is related to the issue you saw. > > I don't t

Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.

2010-07-08 Thread Ted Yu
Todd fixed a bug where LZO header or block header data may fall on read boundary: http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 I am wondering if that is related to the issue you saw. On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment wrote: > A little more

Re: naming the output fle of reduce to the partition number

2010-07-08 Thread Ted Yu
Please take a look at getUniqueName() method of src/mapred/org/apache/hadoop/mapred/FileOutputFormat.java It retrieves "mapred.task.partition" On Thu, Jul 8, 2010 at 2:13 AM, Denim Live wrote: > Hi Everyone, > I am having some problem with naming the output file of each reduce task > with the pa

SequenceFile as map input

2010-07-08 Thread Some Body
To get around the small-file-problem (I have thousands of 2MB log files) I wrote a class to convert all my log files into a single SequenceFile in (Text key, BytesWritable value) format. That works fine. I can run this: hadoop fs -text /my.seq |grep peemt114.log | head -1 10/07/08 15:02:

Re: Is Hadoop suitable for web site visitor analysis?

2010-07-08 Thread Jeff Zhang
You can import the web logs into HDFS, and then use Pig or Hive to do data analysis. See http://hadoop.apache.org/pig/ http://hadoop.apache.org/hive/ On Thu, Jul 8, 2010 at 5:55 PM, Tim Jones wrote: > Hi, > > > I want to be able to discover the 10 most popular routes through our web > site > t

Is Hadoop suitable for web site visitor analysis?

2010-07-08 Thread Tim Jones
Hi, I want to be able to discover the 10 most popular routes through our web site that lead a visitor to register with us. I am already logging page view data but don't seem to be able to find the best solution to query it. (Each Visitor has an ID, each Visitor makes multiple Visits, each w

naming the output fle of reduce to the partition number

2010-07-08 Thread Denim Live
Hi Everyone, I am having some problem with naming the output file of each reduce task with the partition number. First of all, how canĀ  I get the partition number within each reduce? Second, How am I going to name the output file with that partition number? I have looked to the MultipleTextOut

Re: Finding the time it took for a mapreduce job to get executed

2010-07-08 Thread Denim Live
Thanks alex, it worked. From: Alexandros Konstantinakis - Karmis To: mapreduce-user@hadoop.apache.org Sent: Thu, July 8, 2010 9:10:26 AM Subject: Re: Finding the time it took for a mapreduce job to get execute through the web gui. it reports both total time (in

Re: Finding the time it took for a mapreduce job to get executed

2010-07-08 Thread Alexandros Konstantinakis - Karmis
On 07/08/2010 10:51 AM, Denim Live wrote: Hi folks, I want to determine the exact time it took for my mapreduce job to get executed for some anaylsis purpose. How can I calculate it? Thanks through the web gui. it reports both total time (in the job execution page). but you can also get the m

Finding the time it took for a mapreduce job to get executed

2010-07-08 Thread Denim Live
Hi folks, I want to determine the exact time it took for my mapreduce job to get executed for some anaylsis purpose. How can I calculate it? Thanks