Recovering the namenode from failure

2013-05-21 Thread David Parks
I'm on CDH4, and trying to recover both the namenode and cloudera manager VMs from HDFS after losing the namenode. All of our backup VMs are on HDFS, so for the moment I just want to hack something together, copy the backup VMs off HDFS and get on with properly reconfiguring via CDH Manger.

RE: JobClient: Error reading task output - after instituting a DNS server

2013-05-15 Thread David Parks
PM To: user@hadoop.apache.org Subject: Re: JobClient: Error reading task output - after instituting a DNS server HI David. an you explain in a bit more detail what was the issue? Thanks. Shahab On Tue, May 14, 2013 at 2:29 AM, David Parks davidpark...@yahoo.com wrote: I just hate

RE: About configuring cluster setup

2013-05-15 Thread David Parks
We have a box that's a bit overpowered for just running our namenode and jobtracker on a 10-node cluster and we also wanted to make use of the storage and processor resources of that node, like you. What we did is use LXC containers to segregate the different processes. LXC is a very light

RE: JobClient: Error reading task output - after instituting a DNS server

2013-05-14 Thread David Parks
: David Parks [mailto:davidpark...@yahoo.com] Sent: Tuesday, May 14, 2013 1:20 PM To: user@hadoop.apache.org Subject: JobClient: Error reading task output - after instituting a DNS server So we just configured a local DNS server for hostname resolution and stopped using a hosts file and now jobs

600s timeout during copy phase of job

2013-05-13 Thread David Parks
I have a job that's getting 600s task timeouts during the copy phase of the reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and it's taking longer than 10 min to do that copy. The process starts copying when the reduce step is 80% complete. This is a very IO bound task as

Using FairScheduler to limit # of tasks

2013-05-13 Thread David Parks
Can I use the FairScheduler to limit the number of map/reduce tasks directly from the job configuration? E.g. I have 1 job that I know should run a more limited # of map/reduce tasks than is set as the default, I want to configure a queue with a limited # of map/reduce tasks, but only apply it to

RE: Access HDFS from OpenCL

2013-05-13 Thread David Parks
Hadoop just runs as a standard java process, you should find something that bridges between OpenCL and java, a quick google search yields: http://www.jocl.org/ I expect that you'll find everything you need to accomplish the handoff from your mapreduce code to OpenCL there. As for HDFS,

What's the best disk configuration for hadoop? SSD's Raid levels, etc?

2013-05-11 Thread David Parks
We've got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3 disk slots max), they chug away ok currently, only slightly IO bound on average. I'm going to upgrade the disk configuration at some point (we do need more space on HDFS) and I'm thinking about what's best hardware-wise:

RE: Uploading file to HDFS

2013-04-19 Thread David Parks
I think the problem here is that he doesn't have Hadoop installed on this other location so there's no Hadoop DFS client to do the put directly into HDFS on, he would normally copy the file to one of the nodes in the cluster where the client files are installed. I've had the same problem recently.

RE: Uploading file to HDFS

2013-04-19 Thread David Parks
I just realized another trick you might trying. The Hadoop dfs client can read input from STDIN, you could use netcat to pipe the stuff across to HDFS without hitting the hard drive, I haven’t tried it, but here’s what I would think might work: On the Hadoop box, open a listening port and feed

Mapreduce jobs to download job input from across the internet

2013-04-16 Thread David Parks
For a set of jobs to run I need to download about 100GB of data from the internet (~1000 files of varying sizes from ~10 different domains). Currently I do this in a simple linux script as it's easy to script FTP, curl, and the like. But it's a mess to maintain a separate server for that

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-30 Thread David Parks
? Looking forward to hear from you. Thanks Himanish On Fri, Mar 29, 2013 at 10:34 AM, David Parks davidpark...@yahoo.com wrote: CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used it primarily with 1.0.3, which is what AWS uses, so I presume that's what it's

RE: Which hadoop installation should I use on ubuntu server?

2013-03-29 Thread David Parks
Subject: Re: Which hadoop installation should I use on ubuntu server? apache bigtop has builds done for ubuntu you can check them at jenkins mentioned on bigtop.apache.org On Thu, Mar 28, 2013 at 11:37 AM, David Parks davidpark...@yahoo.com wrote: I'm moving off AWS MapReduce to our own

Which hadoop installation should I use on ubuntu server?

2013-03-28 Thread David Parks
I'm moving off AWS MapReduce to our own cluster, I'm installing Hadoop on Ubuntu Server 12.10. I see a .deb installer and installed that, but it seems like files are all over the place `/usr/share/Hadoop`, `/etc/hadoop`, `/usr/bin/hadoop`. And the documentation is a bit harder to follow:

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread David Parks
. Wouldn't some kind of VPN be needed between the Amazon EMR instance and our on-premises hadoop instance ? Did you mean use the jar from amazon on our local server ? Thanks On Thu, Mar 28, 2013 at 3:56 AM, David Parks davidpark...@yahoo.com wrote: Have you tried using s3distcp from amazon? I used

RE: For a new installation: use the BackupNode or the CheckPointNode?

2013-03-26 Thread David Parks
below link will be useful.. http://hadoop.apache.org/docs/stable/hdfs_user_guide.html On Sat, Mar 23, 2013 at 12:29 PM, David Parks davidpark...@yahoo.com wrote: For a new installation of the current stable build (1.1.2 ), is there any reason to use the CheckPointNode over the BackupNode

RE:

2013-03-25 Thread David Parks
Can I suggest an answer of Yes, but you probably don't want to? As a typical user of Hadoop you would not do this. Hadoop already chooses the best server to do the work based on the location of the data (a server that is available to do work and also has the data locally will generally be

For a new installation: use the BackupNode or the CheckPointNode?

2013-03-23 Thread David Parks
For a new installation of the current stable build (1.1.2 ), is there any reason to use the CheckPointNode over the BackupNode? It seems that we need to choose one or the other, and from the docs it seems like the BackupNode is more efficient in its processes.

RE: For a new installation: use the BackupNode or the CheckPointNode?

2013-03-23 Thread David Parks
link will be useful.. http://hadoop.apache.org/docs/stable/hdfs_user_guide.html On Sat, Mar 23, 2013 at 12:29 PM, David Parks davidpark...@yahoo.com wrote: For a new installation of the current stable build (1.1.2 ), is there any reason to use the CheckPointNode over the BackupNode

RE: On a small cluster can we double up namenode/master with tasktrackers?

2013-03-20 Thread David Parks
Good points all, The mapreduce jobs are, well. intensive. We've got a whole variety, but typically I see them use a lot of CPU, a lot of Disk, and upon occasion a whole bunch of Network bandwidth. Duh right? J The master node is mostly CPU intensive right? We're using LXC to segregate

On a small cluster can we double up namenode/master with tasktrackers?

2013-03-18 Thread David Parks
I want 20 servers, I got 7, so I want to make the most of the 7 I have. Each of the 7 servers have: 24GB of ram, 4TB, and 8 cores. Would it be terribly unwise of me to Run such a configuration: . Server #1: NameNode + Master + TaskTracker(reduced slots) . Server

How Alpha is alpha?

2013-03-12 Thread David Parks
From the release page on hadoop's website: This release, like previous releases in hadoop-2.x series is still considered alpha primarily since some of APIs aren't fully-baked and we expect some churn in future. How alpha is the 2.x line? We're moving off AWS (1.0.3) onto our own cluster of

Re: Unexpected Hadoop behavior: map task re-running after reducer has been running

2013-03-11 Thread David Parks
: David Parks davidpark...@yahoo.com To: user@hadoop.apache.org Sent: Monday, March 11, 2013 3:23 PM Subject: Unexpected Hadoop behavior: map task re-running after reducer has been running I can’t explain this behavior, can someone help me here:     Kind  % Complete Num Tasks Pending Running

How do _you_ document your hadoop jobs?

2013-02-25 Thread David Parks
We've taken to documenting our Hadoop jobs in a simple visual manner using PPT (attached example). I wonder how others document their jobs? We often add notes to the text section of the PPT slides as well. image001.jpg

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
(roughly). On Sat, Feb 9, 2013 at 9:24 AM, David Parks davidpark...@yahoo.com wrote: I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node. This job is network IO bound, gathering images from a set of webservers. My job has certain

File does not exist on part-r-00000 file after reducer runs

2013-02-10 Thread David Parks
Are there any rules against writing results to Reducer.Context while in the cleanup() method? I’ve got a reducer that is downloading a few 10’s of millions of images from a set of URLs feed to it. To be efficient I run many connections in parallel, but limit connections per domain and

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
, reduce # of reducers needed?), but it doesn't affect scheduling of a set number of reduce tasks nor does a scheduler care currently if you add that step in or not. On Mon, Feb 11, 2013 at 7:59 AM, David Parks davidpark...@yahoo.com wrote: I guess the FairScheduler is doing multiple assignments per

RE: Question related to Decompressor interface

2013-02-09 Thread David Parks
I can't answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good 'ol SequenceFile, which, as you note, is

How can I limit reducers to one-per-node?

2013-02-08 Thread David Parks
I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node. This job is network IO bound, gathering images from a set of webservers. My job has certain parameters set to meet web politeness standards (e.g. limit connects and

RE: How can I limit reducers to one-per-node?

2013-02-08 Thread David Parks
that, you can modify mapred-site.xml to change it from 3 to 1 Best, -- Nan Zhu School of Computer Science, McGill University On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote: Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my cluster by default

RE: Tricks to upgrading Sequence Files?

2013-01-29 Thread David Parks
- detecting if what they've read is of an older time and decoding it appropriately (while handling newer encoding separately, in the normal fashion). This would be much better than going down the classloader hack paths I think? On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote

Symbolic links available in 1.0.3?

2013-01-28 Thread David Parks
Is it possible to use symbolic links in 1.0.3? If yes: can I use symbolic links to create a single, final directory structure of files from many locations; then use DistCp/S3DistCp to copy that final directory structure to another filesystem such as S3? Usecase: I currently launch 4

Tricks to upgrading Sequence Files?

2013-01-28 Thread David Parks
Anyone have any good tricks for upgrading a sequence file. We maintain a sequence file like a flat file DB and the primary object in there changed in recent development. It's trivial to write a job to read in the sequence file, update the object, and write it back out in the new format.

RE: Skipping entire task

2013-01-05 Thread David Parks
Thinking here... if you submitted the task programmatically you should be able to capture the failure of the task and gracefully move past it to your next tasks. To say it in a long-winded way: Let's say you submit a job to Hadoop, a java jar, and your main class implements Tool. That code has

RE: Fastest way to transfer files

2012-12-29 Thread David Parks
Here’s an example of running distcp (actually in this case s3distcp, but it’s about the same, just new DistCp()) from java: ToolRunner.run(getConf(), new S3DistCp(), new String[] { --src, /src/dir/, --srcPattern, .*(itemtable)-r-[0-9]*.*, --dest,

How to troubleshoot OutOfMemoryError

2012-12-21 Thread David Parks
I'm pretty consistently seeing a few reduce tasks fail with OutOfMemoryError (below). It doesn't kill the job, but it slows it down. In my current case the reducer is pretty darn simple, the algorithm basically does: 1. Do you have 2 values for this key? 2. If so, build a json

How to submit Tool jobs programatically in parallel?

2012-12-13 Thread David Parks
I'm submitting unrelated jobs programmatically (using AWS EMR) so they run in parallel. I'd like to run an s3distcp job in parallel as well, but the interface to that job is a Tool, e.g. ToolRunner.run(...). ToolRunner blocks until the job completes though, so presumably I'd need to create a

RE: Hadoop 101

2012-12-12 Thread David Parks
want to begin with. Thanks for saving me more fruitless searches. On Dec 11, 2012, at 10:04 PM, David Parks davidpark...@yahoo.com wrote: You use TextInputFormat, you'll get the following keyLongWritable, valueText pairs in your mapper: file_position, your_input Example: 0, 0\t[356

Shuffle's getMapOutput() fails with EofException, followed by IllegalStateException

2012-12-12 Thread David Parks
I'm having exactly this problem, and it's causing my job to fail when I try to process a larger amount of data (I'm attempting to process 30GB of compressed CSVs and the entire job fails every time). This issues is open for it: https://issues.apache.org/jira/browse/MAPREDUCE-5 Anyone have any

RE: When reduce function is used as combiner?

2012-12-11 Thread David Parks
The map task may use a combiner 0+ times. Basically that means (as far as I understand), if the map output data is below some internal hadoop threshold, it'll just send it to the reducer, if it's larger then it'll run it through the combiner first. And at hadoops discretion, it may run the

Can we declare some HDFS nodes primary

2012-12-11 Thread David Parks
Assume for a moment that you have a large cluster of 500 AWS spot instance servers running. And you want to keep the bid price low, so at some point it's likely that the whole cluster will get axed until the spot price comes down some. In order to maintain HDFS continuity I'd want say 10

RE: Hadoop 101

2012-12-11 Thread David Parks
You use TextInputFormat, you'll get the following keyLongWritable, valueText pairs in your mapper: file_position, your_input Example: 0, 0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597] 100, 8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786

RE: Hadoop Deployment usecases

2012-12-11 Thread David Parks
You're likely to find answers to your questions here, but you'll need specific questions and some rudimentary subject matter knowledge. I'd suggest starting off with a good book on Hadoop, you'll probably find a lot of your questions are answered in a casual afternoon of reading. I was pretty

RE: Map output copy failure

2012-12-11 Thread David Parks
I had the same problem yesterday, it sure does look to be dead on that issue. I found another forum discussion on AWS that suggested more memory as a stop-gap way to deal with it, or apply the patch. I checked the code on hadoop 1.0.3 (the version on AWS) and it didn't have the fix, so it looks

Map tasks processing some files multiple times

2012-12-05 Thread David Parks
I've got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times. This is the code I use to set up the mapper: Path lsDir = new Path(s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*); for(FileStatus f :

RE: Map tasks processing some files multiple times

2012-12-05 Thread David Parks
@hadoop.apache.org Subject: Re: Map tasks processing some files multiple times Could it be due to spec-ex? Does it make a diffrerence in the end? Raj _ From: David Parks davidpark...@yahoo.com To: user@hadoop.apache.org Sent: Wednesday, December 5, 2012 10:15 PM Subject: Map tasks

RE: [Bulk] Re: Failed To Start SecondaryNameNode in Secure Mode

2012-12-04 Thread David Parks
I'm curious about profiling, I see some documentation about it (1.0.3 on AWS), but the references to JobConf seem to be for the old api and I've got everything running on the new api. I've got a job to handle processing of about 30GB of compressed CSVs and it's taking over a day with 3

RE: Question on Key Grouping

2012-12-04 Thread David Parks
First rule to be wary of is your use of the combiner. The combiner *might* be run, it *might not* be run, and it *might be run multiple times*. The combiner is only for reducing the amount of data going to the reducer, and it will only be run *if and when* it's deemed likely to be useful by

Moving files

2012-11-24 Thread David Parks
I want to move a file in HDFS after a job using the Java API, I'm trying this command but I always get false (could not rename): Path from = new Path(hdfs://localhost/process-changes/itemtable-r-1); Path to = new Path(hdfs://localhost/output/itemtable-r-1); boolean

XMLOutputFormat, anything in the works?

2012-11-19 Thread David Parks
Is there an XMLOutputFormat in existence somewhere? I need to output Solr XML change docs, I'm betting I'm not the first. David

RE: Cluster wide atomic operations

2012-10-29 Thread David Parks
identify which block of IDs to assign each one? Thanks, David From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Monday, October 29, 2012 12:58 PM To: user@hadoop.apache.org Subject: Re: Cluster wide atomic operations On Sun, Oct 28, 2012 at 9:15 PM, David Parks davidpark

RE: Cluster wide atomic operations

2012-10-28 Thread David Parks
then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks davidpark...@yahoo.com wrote: How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does

Cluster wide atomic operations

2012-10-26 Thread David Parks
How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.

MultipleOutputs directed to two different locations

2012-10-25 Thread David Parks
I've got MultipleOutputs configured to generate 2 named outputs. I'd like to send one to s3n:// and one to hdfs:// Is this possible? One is a final summary report, the other is input to the next job. Thanks, David

How do map tasks get assigned efficiently?

2012-10-24 Thread David Parks
Even after reading O'reillys book on hadoop I don't feel like I have a clear vision of how the map tasks get assigned. They depend on splits right? But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources. So I've got some concern as to how

RE: How do map tasks get assigned efficiently?

2012-10-24 Thread David Parks
which Data Node gets the job. HTH -Mike On Oct 24, 2012, at 1:10 AM, David Parks davidpark...@yahoo.com wrote: Even after reading O'reillys book on hadoop I don't feel like I have a clear vision of how the map tasks get assigned. They depend on splits right? But I have 3 jobs

RE: Large input files via HTTP

2012-10-23 Thread David Parks
are the constraints that you are working with? On Mon, Oct 22, 2012 at 5:59 PM, David Parks davidpark...@yahoo.com wrote: Would it make sense to write a map job that takes an unsplittable XML file (which defines all of the files I need to download); that one map job then kicks off the downloads

Large input files via HTTP

2012-10-22 Thread David Parks
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources processes them nightly. Is there a reasonably flexible way to do this in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of

Large input files via HTTP

2012-10-22 Thread David Parks
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the #

RE: Large input files via HTTP

2012-10-22 Thread David Parks
a list of files with tupleshost:port, filePath. Then use a map-only job to pull each file using NLineInputFormat. Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job. On Mon, Oct 22, 2012 at 1:54 AM, David Parks davidpark...@yahoo.com wrote: I want