I would add that it also depends on how thoroughly you have vetted your use
cases. If you have already ironed out how ad-hoc access works, Kerberos vs
Firewall and network segmentation, how code submission works, procedures for
various operational issues, backup of your data, etc (the list is a
Dale,
Talking solely about hadoop core you will only need to run 4 daemons on that
machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to
run multiple of any of them as the tasktracker will spawn multiple child jvms
which is where you will get your task parallelism.
invoke it as mapred.reduce.tasks=1 using
job level conf.
Matt
From: Dale McDiarmid [mailto:d...@ravn.co.uk]
Sent: Thursday, December 15, 2011 3:58 PM
To: common-user@hadoop.apache.org
Cc: GOEKE, MATTHEW [AG/1000]
Subject: Re: Large server recommedations
thanks matt,
Assuming therefore i run
Maneesh,
Firstly, I love the comic :)
Secondly, I am inclined to agree with Prashant on this latest point. While one
code path could take us through the user defining command line overrides (e.g.
hadoop fs -D blah -put foo bar) I think it might confuse a person new to
Hadoop. The most common
Jay,
Did you download stable (0.20.203.X) or 0.23? From what I can tell, after
looking in the tarball for 0.23, it is a different setup then 0.20 (e.g.
hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the
documentation you referenced below is for setting up 0.20.
I
The old API is still fully usable in 0.20.204.
Matt
-Original Message-
From: Jignesh Patel [mailto:jign...@websoft.com]
Sent: Tuesday, October 11, 2011 12:17 PM
To: common-user@hadoop.apache.org
Subject: Re: updated example
Thea means old API is not integrated in 0.20.204.0??
When do
Are you learning for the sake of experimenting or are there functional
requirements driving you to dive into this space?
*If you are learning for the sake of adding new tools to your portfolio: Look
into high level overviews of each of the projects and review architecture
solutions that use
You could always check the web-ui job history for that particular run, open the
job.xml, and search for what the value of that parameter was at runtime.
Matt
-Original Message-
From: patrick sang [mailto:silvianhad...@gmail.com]
Sent: Wednesday, September 28, 2011 4:00 PM
To:
The simplest route I can think of is to ingest the data directly into HDFS
using Sqoop if there is a driver currently made for your database. At that
point it would be relatively simple just to read directly from HDFS in your MR
code.
Matt
-Original Message-
From: lessonz
If you are starting from scratch with no prior Hadoop install experience I
would configure stand-alone, migrate to pseudo distributed and then to fully
distributed verifying functionality at each step by doing a simple word count
run. Also, if you don't mind using the CDH distribution then SCM
I would like to have a robust setup for anything residing on our edge nodes,
which is where these two daemons will be, and I was curious if anyone had any
suggestions around how to replicate / keep an active clone of the metadata for
these components. We already use DRBD and a vip to get around
I was reviewing a video from Hadoop Summit 2011[1] where Arun Murthy mentioned
that MRv2 was moving towards protocol buffers as the wire format but I feel
like this is contrary to an Avro presentation that Doug Cutting did back in
Hadoop World '09[2]. I haven't stayed up to date with the Jira
I would completely agree with Mike's comments with one addition: Hadoop centers
around how to manipulate the flow of data in a way to make the framework work
for your specific problem. There are recipes for common problems but depending
on your domain that might solve only 30-40% of your use
Amusingly this is almost the same question that was asked the other day :)
quote from Owen O'Malley
There isn't currently a way of getting a collated, but unsorted list of
key/value pairs. For most applications, the in memory sort is fairly cheap
relative to the shuffle and other parts of the
if there is a way to disable sorting/shuffling.
Thanks,
Wei
-Original Message-
From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com]
Sent: Tuesday, September 20, 2011 8:34 AM
To: common-user@hadoop.apache.org
Subject: RE: how to set the number of mappers with 0 reducers?
Amusingly
In order to answer you first question we would need to know what types of data
you plan on storing and latency requirements. If it is
semistructured/unstructured data then HBase *can* be a great fit but I have
seen very few cases where you will want to scrap your RDBMS completely. Most
Was the command line output really ever intended to be *that* verbose? I can
see how it would be useful but considering how easy it is to interact with the
web-ui I can't see why much effort should be put into enhancing it. Even if you
didn't want to see all of the details the web-ui has to
Would anyone happen to be able to share a good reference for Avro integration
with Hadoop? I can find plenty of material around using Avro by itself but I
have found little to no documentation on how to implement it as both the
protocol and as custom key/value types.
Thanks,
Matt
This e-mail
All,
We were discussing how we would backup our data from the various environments
we will have and I was hoping someone could chime in with previous experience
in this. My primary concern about our cluster is that we would like to be able
to recover anything within the last 60 days so having
It depends on what scope you want your unit tests to operate at. There is a
class you might want to look into called MiniMRCluster if you are dead set on
having as deep of tests as possible but you can still cover quite a bit with
MRUnit and Junit4/Mockito.
Matt
-Original Message-
If you are asking how to make those classes available at run time you can
either use the -libjars command for the distributed cache or you can just shade
those classes into your jar using maven. I have had enough issues in the past
with classpath being flaky that I prefer the shading method but
Is this just for testing purposes or are you planning on going into production
with this? If it is the latter than I would STRONGLY advise to not give that a
second thought due to how the framework handles I/O. However if you are just
trying to test out distributed daemon setup and get some ops
:04 PM
To: common-user@hadoop.apache.org
Subject: Re: hadoop cluster on VM's
On Mon, Aug 15, 2011 at 7:31 PM, GOEKE, MATTHEW (AG/1000)
matthew.go...@monsanto.com wrote:
Is this just for testing purposes or are you planning on going into
production with this? If it is the latter than I would
Does anyone have any code examples for how they persist join data across
multiple input splits and how they test it? Currently I populate a singleton in
the setup method of my mapper (along with having jvm reuse turned on for this
job) but with no way to have dependency injection into the
Sofia correct me if I am wrong, but Mike I think this thread was about using
the output of a previous job, in this case already in sequence file format, as
in memory join data for another job.
Side note: does anyone know what the rule of thumb on file size is when using
the distributed cache
My assumption would be that having a set of 4 raid 0 disks would actually be
better than having a controller that allowed pure JBOD of 4 disks due to the
cache on the controller. If anyone has any personal experience with this I
would love to know performance numbers but our infrastructure guy
If you have the source downloaded (and if you don't I would suggest you get it)
you can do a search for *InputFormat.java and you will have all the references
you need. Also you might want to check out http://codedemigod.com/blog/?p=120
or take a look at the books Hadoop in action or Hadoop:
All,
I have a MR program that I feed in a list of IDs and it generates the unique
comparison set as a result. Example: if I have a list {1,2,3,4,5} then the
resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or
(n^2-n)/2 number of comparisons. My code works just fine
DN and RS and then a
thread per slot so you end up w 10 slots per node. Of course memory is also a
factor.
Note this is only a starting point.you can always tune up.
Sent from a remote device. Please excuse any typos...
Mike Segel
On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG
Saumitra,
Two questions come to mind that could help you narrow down a solution:
1) How quickly do the downstream processes need the transformed data?
Reason: If you can delay the processing for a period of time, enough to
batch the data into a blob that is a multiple of your block
If you are running default configurations then you are only getting 2 mappers
and 1 reducer per node. The rule of thumb I have gone on (and back up by the
definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots
left. How you break it up from there is your call but I would
Did you make sure to define the datanode/tasktracker in the slaves file in your
conf directory and push that to both machines? Also have you checked the logs
on either to see if there are any errors?
Matt
-Original Message-
From: Jingwei Lu [mailto:j...@ucsd.edu]
Sent: Monday, June
comments will be greatly appreciated!
Best Regards
Yours Sincerely
Jingwei Lu
On Mon, Jun 27, 2011 at 1:28 PM, GOEKE, MATTHEW (AG/1000)
matthew.go...@monsanto.com wrote:
Did you make sure to define the datanode/tasktracker in the slaves file in
your conf directory and push that to both
but always fails...
Best Regards
Yours Sincerely
Jingwei Lu
On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000)
matthew.go...@monsanto.com wrote:
As a follow-up to what Jeff posted: go ahead and ignore the message you got
on the NN for now.
If you look at the address that the DN log shows
Harsh,
Is it possible for mapred.reduce.slowstart.completed.maps to even play a
significant role in this? The only benefit he would find in tweaking that for
his problem would be to spread network traffic from the shuffle over a longer
period of time at a cost of having the reducer using
Is the lookup table constant across each of the tasks? You could try putting it
into memcached:
http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf
Matt
-Original Message-
From: Ian Upright [mailto:i...@upright.net]
Sent: Wednesday, June 15, 2011 3:42 PM
To: common-user@hadoop.apache.org
36 matches
Mail list logo