I am looking for a better solution for this.
1 way to do this would be to find top N values from each mappers and
then find out the top N out of them in 1 reducer. I am afraid that
this won't work effectively if my N is larger than number of values in
my inputsplit (or mapper input).
Otherway
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.
Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote:
Pig. Datafu. 7 lines of code.
Hi,
Can you tell more about:
* How big is N
* How big is the input dataset
* How many mappers you have
* Do input splits correlate with the sorting criterion for top N?
Depending on the answers, very different strategies will be optimal.
On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.
Assuming I have uniform rows of equal size and if the total data size
is 10 GB, using the above mentioned approach, if I have to take top
10% of the whole data set, I need 10% of 10GB
Pig. Datafu. 7 lines of code.
https://gist.github.com/4696443
https://github.com/linkedin/datafu
On Fri, Feb 1, 2013 at 11:17 PM, praveenesh kumar praveen...@gmail.comwrote:
Actually what I am trying to find to top n% of the whole data.
This n could be very large if my data is large.
Thanks for that Russell. Unfortunately I can't use Pig. Need to write
my own MR job. I was wondering how its usually done in the best way
possible.
Regards
Praveenesh
On Sat, Feb 2, 2013 at 1:00 PM, Russell Jurney russell.jur...@gmail.com wrote:
Pig. Datafu. 7 lines of code.
The following line in KMeansBSP.java throws the FileNotFoundException
SequenceFile.Writer centerWriter = SequenceFile.createWriter(fs,
conf, center, VectorWritable.class, NullWritable.class,
CompressionType.NONE);
I am getting the below exception message when I tried executing
How are going to store videos in HDFS?By 'playing video on the browser' I
assume it's gonna be realtime. So, MR is not the correct way to go, I feel.
Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com
On Fri, Feb 1, 2013 at 2:00 PM, gopi lokavarapu hbigdata.g...@gmail.comwrote:
By default SequenceFileOutputFormat expects the
Input - LongWritable
Output - Text
I would like to know how to set custom data types for SequenceFileOutputFormat.
How can I achieve the below configuration with SequenceFileOutputFormat
My mapper/reducer output will be
Key - Text
Value -
Hi
(I am using Yarn Hadoop-3.0.0.SNAPSHOT, revision 1437315M)
I have a question regarding my assumptions on the Yarn-MR design, specially
the InputSplit processing. Can someone confirm or point out my mistakes in
my MR-Yarn design assumptions?
These are my assumptions regarding design.
1.
-- Forwarded message --
From: Vikas Jadhav vikascjadha...@gmail.com
Date: Thu, Jan 31, 2013 at 11:14 PM
Subject: Re: Issue with Reduce Side join using datajoin package
To: user@hadoop.apache.org
***source
public class MyJoin extends Configured
Karmasoftware has a plugin, I downloaded it but could'nt make it work, if time
permits please check.
On Jan 29, 2013, at 11:34 PM, Martinus Martinus martinus...@gmail.com wrote:
Hi YouPeng,
I also wondering the same thing. Is there anybody now about eclipse-plugin
for hadoop?
Thanks.
Thanks for the reply Alejandro. Using a temp output directory was my first
guess as well. What's the best way to proceed? I've come across
FileSystem.rename but it's consistently returning false for whatever Paths I
provide. Specifically, I need to copy the following:
s3://path to data/tmp
You NO need to do anything specifically to emit Text, VectorWritable into
your SequenceFileOutputFormat.
Rather you need to just emit the Key as Text and value as VectorWritable in
your Reducer method.
And just set the setOutputKeyClass and setOutputValueClass with these on
your JobConf or Job
Yesterday, I bounced my DFS cluster. We realized that ulimit –u was, in
extreme cases, preventing the name node from creating threads. This had only
started occurring within the last day or so. When I brought the name node back
up, it had essentially been rolled back by one week, and I lost
Where did you checkout the code from ?
You can get latest update in this JIRA:
HBASE-7290 Online snapshots
Cheers
On Fri, Feb 1, 2013 at 8:53 AM, YouPeng Yang yypvsxf19870...@gmail.comwrote:
Hi
I have got the latest source from Git.
when I perform mvn install -DskipTests.
it was
Hi Manoj,
As you may be aware this means the reduces are unable to fetch intermediate
data from TaskTrackers that ran map tasks - you can try:
* increasing tasktracker.http.threads so there are more threads to handle
fetch requests from reduces.
* decreasing mapreduce.reduce.parallel.copies
Are you trying to build a specific submodule here, i.e, under what
checked out directory are you running the mvn install under? I'd
recommend running it under the root of the checkout (the most parent
pom.xml) first before trying to build a specific sub-module.
On Fri, Feb 1, 2013 at 10:23 PM,
?
On Fri, Feb 1, 2013 at 5:56 PM, Eshan Chakravorty eshan@gmail.comwrote:
s
You got that mostly right. And it doesn't differ much in Hadoop 1.* either.
With MR AM doing the work that was earlier done in JobTracker., the
JobClient and the task side doesn't change much.
FileInputFormat.getsplits() is called by client itself, so you should look
for logs on the client
As it clearly says, check the file permissions of your input directory (*
/hama/input/center/*). Also whether you want the input on local file-sytem
or DFS.
+Vinod
On Fri, Feb 1, 2013 at 12:59 AM, Anbarasan Murthy anbarasa...@hcl.comwrote:
The following line in KMeansBSP.java throws the
Hi harsh.
I actually prefer to use Eclipse for development,so i checkouted the
Hadoop sources from git.
* *I did this following the url:
http://wiki.apache.org/hadoop/EclipseEnvironment
Here are my steps:
1.git clone git://git.apache.org/hadoop-common.git
2.[root@yyp6 hadoop-common]#
Hi Vijay,
Thanks for the information.
Few jobs were running in the cluster at the time.
Cheers!
Manoj.
On Fri, Feb 1, 2013 at 11:22 PM, Vijay Thakorlal vijayj...@hotmail.comwrote:
Hi Manoj,
** **
As you may be aware this means the reduces are unable to fetch
intermediate data from
24 matches
Mail list logo