how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway

Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB

Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Thanks for that Russell. Unfortunately I can't use Pig. Need to write my own MR job. I was wondering how its usually done in the best way possible. Regards Praveenesh On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote: Pig. Datafu. 7 lines of code.

Re: how to find top N values using map-reduce ?

2013-02-01 Thread Eugene Kirpichov
Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar

Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB

Re: how to find top N values using map-reduce ?

2013-02-01 Thread Russell Jurney
Pig. Datafu. 7 lines of code. https://gist.github.com/4696443 https://github.com/linkedin/datafu On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar praveen...@gmail.comwrote: Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large.

Re: how to find top N values using map-reduce ?

2013-02-01 Thread praveenesh kumar
Thanks for that Russell. Unfortunately I can't use Pig. Need to write my own MR job. I was wondering how its usually done in the best way possible. Regards Praveenesh On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote: Pig. Datafu. 7 lines of code.

SequenceFile.createWriter - throws FileNotFoundException

2013-02-01 Thread Anbarasan Murthy
The following line in KMeansBSP.java throws the FileNotFoundException SequenceFile.Writer centerWriter = SequenceFile.createWriter(fs, conf, center, VectorWritable.class, NullWritable.class, CompressionType.NONE); I am getting the below exception message when I tried executing

Re: Hi,can u please help me how to retrieve the videos from hdfs

2013-02-01 Thread Mohammad Tariq
How are going to store videos in HDFS?By 'playing video on the browser' I assume it's gonna be realtime. So, MR is not the correct way to go, I feel. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Feb 1, 2013 at 2:00 PM, gopi lokavarapu hbigdata.g...@gmail.comwrote:

SequenceFileOutputFormat - Custom Type Key Value

2013-02-01 Thread Anbarasan Murthy
By default SequenceFileOutputFormat expects the Input - LongWritable Output - Text I would like to know how to set custom data types for SequenceFileOutputFormat. How can I achieve the below configuration with SequenceFileOutputFormat My mapper/reducer output will be Key - Text Value -

Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.

2013-02-01 Thread blah blah
Hi (I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M) I have a question regarding my assumptions on the Yarn-MR design, specially the InputSplit processing. Can someone confirm or point out my mistakes in my MR-Yarn design assumptions? These are my assumptions regarding design. 1.

Fwd: Issue with Reduce Side join using datajoin package

2013-02-01 Thread Vikas Jadhav
-- Forwarded message -- From: Vikas Jadhav vikascjadha...@gmail.com Date: Thu, Jan 31, 2013 at 11:14 PM Subject: Re: Issue with Reduce Side join using datajoin package To: user@hadoop.apache.org ***source public class MyJoin extends Configured

Re: eclipse plugin

2013-02-01 Thread Tech Mail
Karmasoftware has a plugin, I downloaded it but could'nt make it work, if time permits please check. On Jan 29, 2013, at 11:34 PM, Martinus Martinus martinus...@gmail.com wrote: Hi YouPeng, I also wondering the same thing. Is there anybody now about eclipse-plugin for hadoop? Thanks.

RE: hadoop 1.0.3 equivalent of MultipleTextOutputFormat

2013-02-01 Thread Tony Burton
Thanks for the reply Alejandro. Using a temp output directory was my first guess as well. What's the best way to proceed? I've come across FileSystem.rename but it's consistently returning false for whatever Paths I provide. Specifically, I need to copy the following: s3://path to data/tmp

Re: SequenceFileOutputFormat - Custom Type Key Value

2013-02-01 Thread Mahesh Balija
You NO need to do anything specifically to emit Text, VectorWritable into your SequenceFileOutputFormat. Rather you need to just emit the Key as Text and value as VectorWritable in your Reducer method. And just set the setOutputKeyClass and setOutputValueClass with these on your JobConf or Job

Advice on post mortem of data loss (v 1.0.3)

2013-02-01 Thread Peter Sheridan
Yesterday, I bounced my DFS cluster. We realized that ulimit –u was, in extreme cases, preventing the name node from creating threads. This had only started occurring within the last day or so. When I brought the name node back up, it had essentially been rolled back by one week, and I lost

Re: Apache Development Snapshot Repository does not work

2013-02-01 Thread Ted Yu
Where did you checkout the code from ? You can get latest update in this JIRA: HBASE-7290 Online snapshots Cheers On Fri, Feb 1, 2013 at 8:53 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote: Hi I have got the latest source from Git. when I perform mvn install -DskipTests. it was

RE: Reg Too many fetch-failures Error

2013-02-01 Thread Vijay Thakorlal
Hi Manoj, As you may be aware this means the reduces are unable to fetch intermediate data from TaskTrackers that ran map tasks - you can try: * increasing tasktracker.http.threads so there are more threads to handle fetch requests from reduces. * decreasing mapreduce.reduce.parallel.copies

Re: Apache Development Snapshot Repository does not work

2013-02-01 Thread Harsh J
Are you trying to build a specific submodule here, i.e, under what checked out directory are you running the mvn install under? I'd recommend running it under the root of the checkout (the most parent pom.xml) first before trying to build a specific sub-module. On Fri, Feb 1, 2013 at 10:23 PM,

Re: s

2013-02-01 Thread Vishal Kumar Gupta
? On Fri, Feb 1, 2013 at 5:56 PM, Eshan Chakravorty eshan@gmail.comwrote: s

Re: Hadoop-Yarn-MR reading InputSplits and processing them by the RecordReader, architecture/design question.

2013-02-01 Thread Vinod Kumar Vavilapalli
You got that mostly right. And it doesn't differ much in Hadoop 1.* either. With MR AM doing the work that was earlier done in JobTracker., the JobClient and the task side doesn't change much. FileInputFormat.getsplits() is called by client itself, so you should look for logs on the client

Re: SequenceFile.createWriter - throws FileNotFoundException

2013-02-01 Thread Vinod Kumar Vavilapalli
As it clearly says, check the file permissions of your input directory (* /hama/input/center/*). Also whether you want the input on local file-sytem or DFS. +Vinod On Fri, Feb 1, 2013 at 12:59 AM, Anbarasan Murthy anbarasa...@hcl.comwrote: The following line in KMeansBSP.java throws the

Re: Apache Development Snapshot Repository does not work

2013-02-01 Thread YouPeng Yang
Hi harsh. I actually prefer to use Eclipse for development,so i checkouted the Hadoop sources from git. * *I did this following the url: http://wiki.apache.org/hadoop/EclipseEnvironment Here are my steps: 1.git clone git://git.apache.org/hadoop-common.git 2.[root@yyp6 hadoop-common]#

Re: Reg Too many fetch-failures Error

2013-02-01 Thread Manoj Babu
Hi Vijay, Thanks for the information. Few jobs were running in the cluster at the time. Cheers! Manoj. On Fri, Feb 1, 2013 at 11:22 PM, Vijay Thakorlal vijayj...@hotmail.comwrote: Hi Manoj, ** ** As you may be aware this means the reduces are unable to fetch intermediate data from