Recovering the namenode from failure

2013-05-21 Thread David Parks
I'm on CDH4, and trying to recover both the namenode and cloudera manager VMs from HDFS after losing the namenode. All of our backup VMs are on HDFS, so for the moment I just want to hack something together, copy the backup VMs off HDFS and get on with properly reconfiguring via CDH Manger.

RE: About configuring cluster setup

2013-05-15 Thread David Parks
We have a box that's a bit overpowered for just running our namenode and jobtracker on a 10-node cluster and we also wanted to make use of the storage and processor resources of that node, like you. What we did is use LXC containers to segregate the different processes. LXC is a very light weig

RE: JobClient: Error reading task output - after instituting a DNS server

2013-05-15 Thread David Parks
2013 6:56 PM To: user@hadoop.apache.org Subject: Re: JobClient: Error reading task output - after instituting a DNS server HI David. an you explain in a bit more detail what was the issue? Thanks. Shahab On Tue, May 14, 2013 at 2:29 AM, David Parks wrote: I just hate it when I fi

RE: JobClient: Error reading task output - after instituting a DNS server

2013-05-13 Thread David Parks
From: David Parks [mailto:davidpark...@yahoo.com] Sent: Tuesday, May 14, 2013 1:20 PM To: user@hadoop.apache.org Subject: JobClient: Error reading task output - after instituting a DNS server So we just configured a local DNS server for hostname resolution and stopped using a hosts file and now

RE: Access HDFS from OpenCL

2013-05-13 Thread David Parks
Hadoop just runs as a standard java process, you should find something that bridges between OpenCL and java, a quick google search yields: http://www.jocl.org/ I expect that you'll find everything you need to accomplish the handoff from your mapreduce code to OpenCL there. As for HDFS, hado

Using FairScheduler to limit # of tasks

2013-05-13 Thread David Parks
Can I use the FairScheduler to limit the number of map/reduce tasks directly from the job configuration? E.g. I have 1 job that I know should run a more limited # of map/reduce tasks than is set as the default, I want to configure a queue with a limited # of map/reduce tasks, but only apply it to t

600s timeout during copy phase of job

2013-05-12 Thread David Parks
I have a job that's getting 600s task timeouts during the copy phase of the reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and it's taking longer than 10 min to do that copy. The process starts copying when the reduce step is 80% complete. This is a very IO bound task as

What's the best disk configuration for hadoop? SSD's Raid levels, etc?

2013-05-10 Thread David Parks
We've got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3 disk slots max), they chug away ok currently, only slightly IO bound on average. I'm going to upgrade the disk configuration at some point (we do need more space on HDFS) and I'm thinking about what's best hardware-wise:

RE: Uploading file to HDFS

2013-04-19 Thread David Parks
I just realized another trick you might trying. The Hadoop dfs client can read input from STDIN, you could use netcat to pipe the stuff across to HDFS without hitting the hard drive, I haven’t tried it, but here’s what I would think might work: On the Hadoop box, open a listening port and feed

RE: Uploading file to HDFS

2013-04-19 Thread David Parks
I think the problem here is that he doesn't have Hadoop installed on this other location so there's no Hadoop DFS client to do the put directly into HDFS on, he would normally copy the file to one of the nodes in the cluster where the client files are installed. I've had the same problem recently.

Mapreduce jobs to download job input from across the internet

2013-04-16 Thread David Parks
For a set of jobs to run I need to download about 100GB of data from the internet (~1000 files of varying sizes from ~10 different domains). Currently I do this in a simple linux script as it's easy to script FTP, curl, and the like. But it's a mess to maintain a separate server for that proces

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-30 Thread David Parks
on EC2 and amazon S3 ? Looking forward to hear from you. Thanks Himanish On Fri, Mar 29, 2013 at 10:34 AM, David Parks wrote: CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used it primarily with 1.0.3, which is what AWS uses, so I presume that's wh

Re: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-29 Thread David Parks
ed to get all the jars ( >else it could cause version mismatch errors for HDFS - NoSuchMethodError etc >etc )  > > >Appreciate your help regarding this. > > >- Himanish > > > > >On Fri, Mar 29, 2013 at 1:41 AM, David Parks wrote: > >None of that complexity, they distrib

RE: Which hadoop installation should I use on ubuntu server?

2013-03-29 Thread David Parks
: Friday, March 29, 2013 3:21 PM To: user Subject: Re: Which hadoop installation should I use on ubuntu server? I recommend cloudera's CDH4 on ubuntu 12.04 LTS On Thu, Mar 28, 2013 at 7:07 AM, David Parks wrote: I’m moving off AWS MapReduce to our own cluster, I’m installing Hadoop on U

RE: Which hadoop installation should I use on ubuntu server?

2013-03-29 Thread David Parks
apache.org Subject: Re: Which hadoop installation should I use on ubuntu server? apache bigtop has builds done for ubuntu you can check them at jenkins mentioned on bigtop.apache.org On Thu, Mar 28, 2013 at 11:37 AM, David Parks wrote: I'm moving off AWS MapReduce to our o

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread David Parks
r corporate LAN.Could you please provide some details on how i could use the s3distcp from amazon to transfer data from our on-premises hadoop to amazon s3. Wouldn't some kind of VPN be needed between the Amazon EMR instance and our on-premises hadoop instance ? Did you mean use the jar from amaz

RE: Hadoop distcp from CDH4 to Amazon S3 - Improve Throughput

2013-03-28 Thread David Parks
Have you tried using s3distcp from amazon? I used it many times to transfer 1.5TB between S3 and Hadoop instances. The process took 45 min, well over the 10min timeout period you're running into a problem on. Dave From: Himanish Kushary [mailto:himan...@gmail.com] Sent: Thursday, March

Which hadoop installation should I use on ubuntu server?

2013-03-27 Thread David Parks
I'm moving off AWS MapReduce to our own cluster, I'm installing Hadoop on Ubuntu Server 12.10. I see a .deb installer and installed that, but it seems like files are all over the place `/usr/share/Hadoop`, `/etc/hadoop`, `/usr/bin/hadoop`. And the documentation is a bit harder to follow: ht

RE: For a new installation: use the BackupNode or the CheckPointNode?

2013-03-26 Thread David Parks
M, varun kumar wrote: > Hope below link will be useful.. > > http://hadoop.apache.org/docs/stable/hdfs_user_guide.html > > > On Sat, Mar 23, 2013 at 12:29 PM, David Parks > wrote: >> >> For a new installation of the current stable build (1.1.2 ), is there >>

RE:

2013-03-25 Thread David Parks
Can I suggest an answer of "Yes, but you probably don't want to"? As a "typical user" of Hadoop you would not do this. Hadoop already chooses the best server to do the work based on the location of the data (a server that is available to do work and also has the data locally will generally be ass

RE: For a new installation: use the BackupNode or the CheckPointNode?

2013-03-23 Thread David Parks
> http://hadoop.apache.org/docs/stable/hdfs_user_guide.html > > > On Sat, Mar 23, 2013 at 12:29 PM, David Parks > wrote: >> >> For a new installation of the current stable build (1.1.2 ), is there >> any reason to use the CheckPointNode over the BackupNode? >> >&g

For a new installation: use the BackupNode or the CheckPointNode?

2013-03-23 Thread David Parks
For a new installation of the current stable build (1.1.2 ), is there any reason to use the CheckPointNode over the BackupNode? It seems that we need to choose one or the other, and from the docs it seems like the BackupNode is more efficient in its processes.

RE: On a small cluster can we double up namenode/master with tasktrackers?

2013-03-20 Thread David Parks
Good points all, The mapreduce jobs are, well. intensive. We've got a whole variety, but typically I see them use a lot of CPU, a lot of Disk, and upon occasion a whole bunch of Network bandwidth. Duh right? J The master node is mostly CPU intensive right? We're using LXC to segregate (ps

On a small cluster can we double up namenode/master with tasktrackers?

2013-03-18 Thread David Parks
I want 20 servers, I got 7, so I want to make the most of the 7 I have. Each of the 7 servers have: 24GB of ram, 4TB, and 8 cores. Would it be terribly unwise of me to Run such a configuration: . Server #1: NameNode + Master + TaskTracker(reduced slots) . Server #

How "Alpha" is "alpha"?

2013-03-12 Thread David Parks
>From the release page on hadoop's website: "This release, like previous releases in hadoop-2.x series is still considered alpha primarily since some of APIs aren't fully-baked and we expect some churn in future." How "alpha" is the 2.x line? We're moving off AWS (1.0.3) onto our own cluste

Re: Unexpected Hadoop behavior: map task re-running after reducer has been running

2013-03-11 Thread David Parks
-failures Task attempt_201303080219_0002_r_006026_0 failed to report status for 7201 seconds. Killing! attempt_201303080219_0002_r_006026_0: [Fatal Error] :1:1: Premature end of file. Too many fetch-failures Too many fetch-failures Too many fetch-failures From: David

Unexpected Hadoop behavior: map task re-running after reducer has been running

2013-03-11 Thread David Parks
I can't explain this behavior, can someone help me here: Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed Task Attempts map 100.00%23547 0 123546 0 247 / 0 reduce 62.40%13738 30 6232 0 336

How do _you_ document your hadoop jobs?

2013-02-25 Thread David Parks
We've taken to documenting our Hadoop jobs in a simple visual manner using PPT (attached example). I wonder how others document their jobs? We often add notes to the text section of the PPT slides as well. <>

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
e # of reducers needed?), but it doesn't affect scheduling of a set number of reduce tasks nor does a scheduler care currently if you add that step in or not. On Mon, Feb 11, 2013 at 7:59 AM, David Parks wrote: > I guess the FairScheduler is doing multiple assignments per heartbeat,

File does not exist on part-r-00000 file after reducer runs

2013-02-10 Thread David Parks
Are there any rules against writing results to Reducer.Context while in the cleanup() method? I’ve got a reducer that is downloading a few 10’s of millions of images from a set of URLs feed to it. To be efficient I run many connections in parallel, but limit connections per domain and frequ

RE: Question related to Decompressor interface

2013-02-10 Thread David Parks
In the EncryptedWritableWrapper idea you would create an object that takes any Writable object as it's parameter. Your EncryptedWritableWrapper would naturally implement Writable. . When write(DataOutput out) is called on your object, create your own DataOutputStream which reads da

RE: How can I limit reducers to one-per-node?

2013-02-10 Thread David Parks
ughly). On Sat, Feb 9, 2013 at 9:24 AM, David Parks wrote: I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node. This job is network IO bound, gathering images from a set of webservers. My job has certain parameters set to

RE: Question related to Decompressor interface

2013-02-09 Thread David Parks
I can't answer your question about the Decompressor interface, but I have a query for you. Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes on the read/write method, that should be darn near trivial. Then stick with good 'ol SequenceFile, which, as you note, is splitta

RE: How can I limit reducers to one-per-node?

2013-02-08 Thread David Parks
llows you to do that, you can modify mapred-site.xml to change it from 3 to 1 Best, -- Nan Zhu School of Computer Science, McGill University On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote: Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my clu

RE: How can I limit reducers to one-per-node?

2013-02-08 Thread David Parks
@hadoop.apache.org Subject: Re: How can I limit reducers to one-per-node? I think set tasktracker.reduce.tasks.maximum to be 1 may meet your requirement Best, -- Nan Zhu School of Computer Science, McGill University On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote: I have

How can I limit reducers to one-per-node?

2013-02-08 Thread David Parks
I have a cluster of boxes with 3 reducers per node. I want to limit a particular job to only run 1 reducer per node. This job is network IO bound, gathering images from a set of webservers. My job has certain parameters set to meet "web politeness" standards (e.g. limit connects and connect

RE: Tricks to upgrading Sequence Files?

2013-01-29 Thread David Parks
ersioning perhaps - detecting if what they've read is of an older time and decoding it appropriately (while handling newer encoding separately, in the normal fashion). This would be much better than going down the classloader hack paths I think? On Tue, Jan 29, 2013 at 1:11 PM, David Parks wr

Tricks to upgrading Sequence Files?

2013-01-28 Thread David Parks
Anyone have any good tricks for upgrading a sequence file. We maintain a sequence file like a flat file DB and the primary object in there changed in recent development. It's trivial to write a job to read in the sequence file, update the object, and write it back out in the new format.

Symbolic links available in 1.0.3?

2013-01-28 Thread David Parks
Is it possible to use symbolic links in 1.0.3? If yes: can I use symbolic links to create a single, final directory structure of files from many locations; then use DistCp/S3DistCp to copy that final directory structure to another filesystem such as S3? Usecase: I currently launch 4 S3Dist

RE: Skipping entire task

2013-01-05 Thread David Parks
Thinking here... if you submitted the task programmatically you should be able to capture the failure of the task and gracefully move past it to your next tasks. To say it in a long-winded way: Let's say you submit a job to Hadoop, a java jar, and your main class implements Tool. That code has th

RE: Fastest way to transfer files

2012-12-29 Thread David Parks
Here’s an example of running distcp (actually in this case s3distcp, but it’s about the same, just new DistCp()) from java: ToolRunner.run(getConf(), new S3DistCp(), new String[] { "--src", "/src/dir/", "--srcPattern", ".*(itemtable)-r-[0-9]*.*", "--des

RE: What does mapred.map.tasksperslot do?

2012-12-27 Thread David Parks
mapred.map.tasksperslot do? David, Could you please tell what version of Hadoop you are using ? I don't see this parameter in the stable (1.x) or current branch. I only see references to it with respect to EMR and with Hadoop 0.18 or so. On Thu, Dec 27, 2012 at 1:51 PM, David Parks

What does mapred.map.tasksperslot do?

2012-12-27 Thread David Parks
I didn't come up with much in a google search. In particular, what are the side effects of changing this setting? Memory? Sort process? I'm guessing it means that it'll feed 2 map tasks as input to each map task, a map task in turn is a self-contained JVM which consumes one map slot. Th

How to troubleshoot OutOfMemoryError

2012-12-21 Thread David Parks
I'm pretty consistently seeing a few reduce tasks fail with OutOfMemoryError (below). It doesn't kill the job, but it slows it down. In my current case the reducer is pretty darn simple, the algorithm basically does: 1. Do you have 2 values for this key? 2. If so, build a json str

OutOfMemory in ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory

2012-12-16 Thread David Parks
I've got 15 boxes in a cluster, 7.5GB of ram each on AWS (m1.large), 1 reducer per node. I'm seeing this exception sometimes. It's not stopping the job from completing, it's just failing 3 or 4 reduce tasks and slowing things down: Error: java.lang.OutOfMemoryError: Java heap space

RE: How to submit Tool jobs programatically in parallel?

2012-12-13 Thread David Parks
Client(job); jc.submitJob(job); Cheers! Manoj. On Fri, Dec 14, 2012 at 10:09 AM, David Parks wrote: I'm submitting unrelated jobs programmatically (using AWS EMR) so they run in parallel. I'd like to run an s3distcp job in parallel as well, but the interface to tha

How to submit Tool jobs programatically in parallel?

2012-12-13 Thread David Parks
I'm submitting unrelated jobs programmatically (using AWS EMR) so they run in parallel. I'd like to run an s3distcp job in parallel as well, but the interface to that job is a Tool, e.g. ToolRunner.run(...). ToolRunner blocks until the job completes though, so presumably I'd need to create a thre

RE: Shuffle's getMapOutput() fails with EofException, followed by IllegalStateException

2012-12-13 Thread David Parks
h the filesystem directly and was leaving a connection to it open which was hanging the map tasks (without error) that used that code. -Original Message- From: David Parks [mailto:davidpark...@yahoo.com] Sent: Thursday, December 13, 2012 11:23 AM To: user@hadoop.apache.org Subject: Shuf

Shuffle's getMapOutput() fails with EofException, followed by IllegalStateException

2012-12-12 Thread David Parks
I'm having exactly this problem, and it's causing my job to fail when I try to process a larger amount of data (I'm attempting to process 30GB of compressed CSVs and the entire job fails every time). This issues is open for it: https://issues.apache.org/jira/browse/MAPREDUCE-5 Anyone have any ide

RE: Hadoop 101

2012-12-12 Thread David Parks
alls. They aren't quite what I want to begin with. Thanks for saving me more fruitless searches. On Dec 11, 2012, at 10:04 PM, David Parks wrote: You use TextInputFormat, you'll get the following key, value pairs in your mapper: file_position, your_input Example: 0, "0\t[

RE: Map output copy failure

2012-12-11 Thread David Parks
I had the same problem yesterday, it sure does look to be dead on that issue. I found another forum discussion on AWS that suggested more memory as a stop-gap way to deal with it, or apply the patch. I checked the code on hadoop 1.0.3 (the version on AWS) and it didn't have the fix, so it looks lik

RE: Hadoop Deployment usecases

2012-12-11 Thread David Parks
You're likely to find answers to your questions here, but you'll need specific questions and some rudimentary subject matter knowledge. I'd suggest starting off with a good book on Hadoop, you'll probably find a lot of your questions are answered in a casual afternoon of reading. I was pretty happy

RE: Hadoop 101

2012-12-11 Thread David Parks
You use TextInputFormat, you'll get the following key, value pairs in your mapper: file_position, your_input Example: 0, "0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]" 100, "8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786 037]" 200, "25\t[

Can we declare some HDFS nodes "primary"

2012-12-11 Thread David Parks
Assume for a moment that you have a large cluster of 500 AWS spot instance servers running. And you want to keep the bid price low, so at some point it's likely that the whole cluster will get axed until the spot price comes down some. In order to maintain HDFS continuity I'd want say 10 server

RE: When reduce function is used as combiner?

2012-12-11 Thread David Parks
The map task may use a combiner 0+ times. Basically that means (as far as I understand), if the map output data is below some internal hadoop threshold, it'll just send it to the reducer, if it's larger then it'll run it through the combiner first. And at hadoops discretion, it may run the combiner

RE: Map tasks processing some files multiple times

2012-12-06 Thread David Parks
n the reason for using MultipleInputs ? On Thu, Dec 6, 2012 at 2:59 PM, David Parks wrote: Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat to replace the LongWritable key with a key representing the file name. It was a bit tricky to do because of changing the generics

RE: Map tasks processing some files multiple times

2012-12-06 Thread David Parks
adding the same input format for all files, do you need MultipleInputs ? Thanks Hemanth On Thu, Dec 6, 2012 at 1:06 PM, David Parks wrote: I believe I just tracked down the problem, maybe you can help confirm if you're familiar with this. I see that FileInputFormat is specifying

RE: Map tasks processing some files multiple times

2012-12-05 Thread David Parks
M To: user@hadoop.apache.org Subject: Re: Map tasks processing some files multiple times Could it be due to spec-ex? Does it make a diffrerence in the end? Raj _ From: David Parks To: user@hadoop.apache.org Sent: Wednesday, December 5, 2012 10:15 PM Subject: Map tasks processing

Map tasks processing some files multiple times

2012-12-05 Thread David Parks
I've got a job that reads in 167 files from S3, but 2 of the files are being mapped twice and 1 of the files is mapped 3 times. This is the code I use to set up the mapper: Path lsDir = new Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*"); for(FileStatus f :

RE: Question on Key Grouping

2012-12-04 Thread David Parks
First rule to be wary of is your use of the combiner. The combiner *might* be run, it *might not* be run, and it *might be run multiple times*. The combiner is only for reducing the amount of data going to the reducer, and it will only be run *if and when* it's deemed likely to be useful by Hadoop.

RE: [Bulk] Re: Failed To Start SecondaryNameNode in Secure Mode

2012-12-04 Thread David Parks
I'm curious about profiling, I see some documentation about it (1.0.3 on AWS), but the references to JobConf seem to be for the "old api" and I've got everything running on the "new api". I've got a job to handle processing of about 30GB of compressed CSVs and it's taking over a day with 3 m1.m

Moving files

2012-11-24 Thread David Parks
I want to move a file in HDFS after a job using the Java API, I'm trying this command but I always get false (could not rename): Path from = new Path("hdfs://localhost/process-changes/itemtable-r-1"); Path to = new Path("hdfs://localhost/output/itemtable-r-1"); bool

XMLOutputFormat, anything in the works?

2012-11-19 Thread David Parks
Is there an XMLOutputFormat in existence somewhere? I need to output Solr XML change docs, I'm betting I'm not the first. David

RE: Cluster wide atomic operations

2012-10-28 Thread David Parks
ch that I could identify which block of IDs to assign each one? Thanks, David From: Ted Dunning [mailto:tdunn...@maprtech.com] Sent: Monday, October 29, 2012 12:58 PM To: user@hadoop.apache.org Subject: Re: Cluster wide atomic operations On Sun, Oct 28, 2012 at 9:15 PM, David P

RE: Cluster wide atomic operations

2012-10-28 Thread David Parks
f you an batch these operations up then you can cut the evilness of global atomicity by a substantial factor. Are you sure you need a global counter? On Fri, Oct 26, 2012 at 11:07 PM, David Parks wrote: How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment co

Cluster wide atomic operations

2012-10-26 Thread David Parks
How can we manage cluster-wide atomic operations? Such as maintaining an auto-increment counter. Does Hadoop provide native support for these kinds of operations? An in case ultimate answer involves zookeeper, I'd love to work out doing this in AWS/EMR.

MultipleOutputs directed to two different locations

2012-10-25 Thread David Parks
I've got MultipleOutputs configured to generate 2 named outputs. I'd like to send one to s3n:// and one to hdfs:// Is this possible? One is a final summary report, the other is input to the next job. Thanks, David

RE: How do map tasks get assigned efficiently?

2012-10-24 Thread David Parks
ocal so it doesn't matter which Data Node gets the job. HTH -Mike On Oct 24, 2012, at 1:10 AM, David Parks wrote: Even after reading O'reillys book on hadoop I don't feel like I have a clear vision of how the map tasks get assigned. They depend on splits right?

How do map tasks get assigned efficiently?

2012-10-23 Thread David Parks
Even after reading O'reillys book on hadoop I don't feel like I have a clear vision of how the map tasks get assigned. They depend on splits right? But I have 3 jobs running. And splits will come from various sources: HDFS, S3, and slow HTTP sources. So I've got some concern as to how t

RE: Large input files via HTTP

2012-10-23 Thread David Parks
What are the constraints that you are working with? On Mon, Oct 22, 2012 at 5:59 PM, David Parks wrote: Would it make sense to write a map job that takes an unsplittable XML file (which defines all of the files I need to download); that one map job then kicks off the downloads in multiple thr

RE: Large input files via HTTP

2012-10-22 Thread David Parks
list of files with tuples. Then use a map-only job to pull each file using NLineInputFormat. Another way is to write a HttpInputFormat and HttpRecordReader and stream the data in a map-only job. On Mon, Oct 22, 2012 at 1:54 AM, David Parks wrote: I want to create a MapReduce job which reads many

Large input files via HTTP

2012-10-22 Thread David Parks
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to acquire the files in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the #

Large input files via HTTP

2012-10-22 Thread David Parks
I want to create a MapReduce job which reads many multi-gigabyte input files from various HTTP sources & processes them nightly. Is there a reasonably flexible way to do this in the Hadoop job its self? I expect the initial downloads to take many hours and I'd hope I can optimize the # of connecti