RE: Experience with Hadoop in production

2012-02-24 Thread GOEKE, MATTHEW (AG/1000)
I would add that it also depends on how thoroughly you have vetted your use cases. If you have already ironed out how ad-hoc access works, Kerberos vs Firewall and network segmentation, how code submission works, procedures for various operational issues, backup of your data, etc (the list is a

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
Dale, Talking solely about hadoop core you will only need to run 4 daemons on that machine: Namenode, Jobtracker, Datanode and Tasktracker. There is no reason to run multiple of any of them as the tasktracker will spawn multiple child jvms which is where you will get your task parallelism.

RE: Large server recommedations

2011-12-15 Thread GOEKE, MATTHEW (AG/1000)
invoke it as mapred.reduce.tasks=1 using job level conf. Matt From: Dale McDiarmid [mailto:d...@ravn.co.uk] Sent: Thursday, December 15, 2011 3:58 PM To: common-user@hadoop.apache.org Cc: GOEKE, MATTHEW [AG/1000] Subject: Re: Large server recommedations thanks matt, Assuming therefore i run

RE: HDFS Explained as Comics

2011-11-30 Thread GOEKE, MATTHEW (AG/1000)
Maneesh, Firstly, I love the comic :) Secondly, I am inclined to agree with Prashant on this latest point. While one code path could take us through the user defining command line overrides (e.g. hadoop fs -D blah -put foo bar) I think it might confuse a person new to Hadoop. The most common

RE: No HADOOP COMMON HOME set.

2011-11-17 Thread GOEKE, MATTHEW (AG/1000)
Jay, Did you download stable (0.20.203.X) or 0.23? From what I can tell, after looking in the tarball for 0.23, it is a different setup then 0.20 (e.g. hadoop-env.sh doesn't exist anymore and is replaced by yarn-env.sh) and the documentation you referenced below is for setting up 0.20. I

RE: updated example

2011-10-11 Thread GOEKE, MATTHEW (AG/1000)
The old API is still fully usable in 0.20.204. Matt -Original Message- From: Jignesh Patel [mailto:jign...@websoft.com] Sent: Tuesday, October 11, 2011 12:17 PM To: common-user@hadoop.apache.org Subject: Re: updated example Thea means old API is not integrated in 0.20.204.0?? When do

RE: Learning curve after MapReduce and HDFS

2011-09-30 Thread GOEKE, MATTHEW (AG/1000)
Are you learning for the sake of experimenting or are there functional requirements driving you to dive into this space? *If you are learning for the sake of adding new tools to your portfolio: Look into high level overviews of each of the projects and review architecture solutions that use

RE: dump configuration

2011-09-28 Thread GOEKE, MATTHEW (AG/1000)
You could always check the web-ui job history for that particular run, open the job.xml, and search for what the value of that parameter was at runtime. Matt -Original Message- From: patrick sang [mailto:silvianhad...@gmail.com] Sent: Wednesday, September 28, 2011 4:00 PM To:

RE: Temporary Files to be sent to DistributedCache

2011-09-27 Thread GOEKE, MATTHEW (AG/1000)
The simplest route I can think of is to ingest the data directly into HDFS using Sqoop if there is a driver currently made for your database. At that point it would be relatively simple just to read directly from HDFS in your MR code. Matt -Original Message- From: lessonz

RE: Environment consideration for a research on scheduling

2011-09-23 Thread GOEKE, MATTHEW (AG/1000)
If you are starting from scratch with no prior Hadoop install experience I would configure stand-alone, migrate to pseudo distributed and then to fully distributed verifying functionality at each step by doing a simple word count run. Also, if you don't mind using the CDH distribution then SCM

Question regarding Oozie and Hive replication / backup

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I would like to have a robust setup for anything residing on our edge nodes, which is where these two daemons will be, and I was curious if anyone had any suggestions around how to replicate / keep an active clone of the metadata for these components. We already use DRBD and a vip to get around

Hadoop RPC and general serialization question

2011-09-22 Thread GOEKE, MATTHEW (AG/1000)
I was reviewing a video from Hadoop Summit 2011[1] where Arun Murthy mentioned that MRv2 was moving towards protocol buffers as the wire format but I feel like this is contrary to an Avro presentation that Doug Cutting did back in Hadoop World '09[2]. I haven't stayed up to date with the Jira

RE: risks of using Hadoop

2011-09-21 Thread GOEKE, MATTHEW (AG/1000)
I would completely agree with Mike's comments with one addition: Hadoop centers around how to manipulate the flow of data in a way to make the framework work for your specific problem. There are recipes for common problems but depending on your domain that might solve only 30-40% of your use

RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
Amusingly this is almost the same question that was asked the other day :) quote from Owen O'Malley There isn't currently a way of getting a collated, but unsorted list of key/value pairs. For most applications, the in memory sort is fairly cheap relative to the shuffle and other parts of the

RE: how to set the number of mappers with 0 reducers?

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
if there is a way to disable sorting/shuffling. Thanks, Wei -Original Message- From: GOEKE, MATTHEW (AG/1000) [mailto:matthew.go...@monsanto.com] Sent: Tuesday, September 20, 2011 8:34 AM To: common-user@hadoop.apache.org Subject: RE: how to set the number of mappers with 0 reducers? Amusingly

RE: Using HBase for real time transaction

2011-09-20 Thread GOEKE, MATTHEW (AG/1000)
In order to answer you first question we would need to know what types of data you plan on storing and latency requirements. If it is semistructured/unstructured data then HBase *can* be a great fit but I have seen very few cases where you will want to scrap your RDBMS completely. Most

RE: phases of Hadoop Jobs

2011-09-19 Thread GOEKE, MATTHEW (AG/1000)
Was the command line output really ever intended to be *that* verbose? I can see how it would be useful but considering how easy it is to interact with the web-ui I can't see why much effort should be put into enhancing it. Even if you didn't want to see all of the details the web-ui has to

Hadoop/CDH + Avro

2011-09-13 Thread GOEKE, MATTHEW (AG/1000)
Would anyone happen to be able to share a good reference for Avro integration with Hadoop? I can find plenty of material around using Avro by itself but I have found little to no documentation on how to implement it as both the protocol and as custom key/value types. Thanks, Matt This e-mail

Hadoop multi tier backup

2011-08-30 Thread GOEKE, MATTHEW (AG/1000)
All, We were discussing how we would backup our data from the various environments we will have and I was hoping someone could chime in with previous experience in this. My primary concern about our cluster is that we would like to be able to recover anything within the last 60 days so having

RE: Hadoop in process?

2011-08-26 Thread GOEKE, MATTHEW (AG/1000)
It depends on what scope you want your unit tests to operate at. There is a class you might want to look into called MiniMRCluster if you are dead set on having as deep of tests as possible but you can still cover quite a bit with MRUnit and Junit4/Mockito. Matt -Original Message-

RE: Making sure I understand HADOOP_CLASSPATH

2011-08-22 Thread GOEKE, MATTHEW (AG/1000)
If you are asking how to make those classes available at run time you can either use the -libjars command for the distributed cache or you can just shade those classes into your jar using maven. I have had enough issues in the past with classpath being flaky that I prefer the shading method but

RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Is this just for testing purposes or are you planning on going into production with this? If it is the latter than I would STRONGLY advise to not give that a second thought due to how the framework handles I/O. However if you are just trying to test out distributed daemon setup and get some ops

RE: hadoop cluster on VM's

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
:04 PM To: common-user@hadoop.apache.org Subject: Re: hadoop cluster on VM's On Mon, Aug 15, 2011 at 7:31 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Is this just for testing purposes or are you planning on going into production with this? If it is the latter than I would

Unit testing MR without dependency injection

2011-08-15 Thread GOEKE, MATTHEW (AG/1000)
Does anyone have any code examples for how they persist join data across multiple input splits and how they test it? Currently I populate a singleton in the setup method of my mapper (along with having jvm reuse turned on for this job) but with no way to have dependency injection into the

RE: Hadoop--store a sequence file in distributed cache?

2011-08-12 Thread GOEKE, MATTHEW (AG/1000)
Sofia correct me if I am wrong, but Mike I think this thread was about using the output of a previous job, in this case already in sequence file format, as in memory join data for another job. Side note: does anyone know what the rule of thumb on file size is when using the distributed cache

RE: Question about RAID controllers and hadoop

2011-08-11 Thread GOEKE, MATTHEW (AG/1000)
My assumption would be that having a set of 4 raid 0 disks would actually be better than having a controller that allowed pure JBOD of 4 disks due to the cache on the controller. If anyone has any personal experience with this I would love to know performance numbers but our infrastructure guy

RE: Giving filename as key to mapper ?

2011-07-15 Thread GOEKE, MATTHEW (AG/1000)
If you have the source downloaded (and if you don't I would suggest you get it) you can do a search for *InputFormat.java and you will have all the references you need. Also you might want to check out http://codedemigod.com/blog/?p=120 or take a look at the books Hadoop in action or Hadoop:

Issue with MR code not scaling correctly with data sizes

2011-07-14 Thread GOEKE, MATTHEW (AG/1000)
All, I have a MR program that I feed in a list of IDs and it generates the unique comparison set as a result. Example: if I have a list {1,2,3,4,5} then the resulting output would be {2x1, 3x2, 3x1, 4x3, 4x2, 4x1, 5x4, 5x3, 5x2, 5x1} or (n^2-n)/2 number of comparisons. My code works just fine

RE: Performance Tunning

2011-06-28 Thread GOEKE, MATTHEW (AG/1000)
DN and RS and then a thread per slot so you end up w 10 slots per node. Of course memory is also a factor. Note this is only a starting point.you can always tune up. Sent from a remote device. Please excuse any typos... Mike Segel On Jun 27, 2011, at 11:11 PM, GOEKE, MATTHEW (AG

RE: Queue support from HDFS

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Saumitra, Two questions come to mind that could help you narrow down a solution: 1) How quickly do the downstream processes need the transformed data? Reason: If you can delay the processing for a period of time, enough to batch the data into a blob that is a multiple of your block

RE: Performance Tunning

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
If you are running default configurations then you are only getting 2 mappers and 1 reducer per node. The rule of thumb I have gone on (and back up by the definitive guide) is 2 processes per core so: tasktracker/datanode and 6 slots left. How you break it up from there is your call but I would

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both machines? Also have you checked the logs on either to see if there are any errors? Matt -Original Message- From: Jingwei Lu [mailto:j...@ucsd.edu] Sent: Monday, June

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
comments will be greatly appreciated! Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 1:28 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Did you make sure to define the datanode/tasktracker in the slaves file in your conf directory and push that to both

RE: Why I cannot see live nodes in a LAN-based cluster setup?

2011-06-27 Thread GOEKE, MATTHEW (AG/1000)
but always fails... Best Regards Yours Sincerely Jingwei Lu On Mon, Jun 27, 2011 at 2:22 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: As a follow-up to what Jeff posted: go ahead and ignore the message you got on the NN for now. If you look at the address that the DN log shows

RE: Poor scalability with map reduce application

2011-06-21 Thread GOEKE, MATTHEW (AG/1000)
Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using

RE: large memory tasks

2011-06-15 Thread GOEKE, MATTHEW (AG/1000)
Is the lookup table constant across each of the tasks? You could try putting it into memcached: http://hcil.cs.umd.edu/trs/2009-01/2009-01.pdf Matt -Original Message- From: Ian Upright [mailto:i...@upright.net] Sent: Wednesday, June 15, 2011 3:42 PM To: common-user@hadoop.apache.org