I'm on CDH4, and trying to recover both the namenode and cloudera manager
VMs from HDFS after losing the namenode.
All of our backup VMs are on HDFS, so for the moment I just want to hack
something together, copy the backup VMs off HDFS and get on with properly
reconfiguring via CDH Manger.
PM
To: user@hadoop.apache.org
Subject: Re: JobClient: Error reading task output - after instituting a DNS
server
HI David. an you explain in a bit more detail what was the issue? Thanks.
Shahab
On Tue, May 14, 2013 at 2:29 AM, David Parks davidpark...@yahoo.com wrote:
I just hate
We have a box that's a bit overpowered for just running our namenode and
jobtracker on a 10-node cluster and we also wanted to make use of the
storage and processor resources of that node, like you.
What we did is use LXC containers to segregate the different processes. LXC
is a very light
: David Parks [mailto:davidpark...@yahoo.com]
Sent: Tuesday, May 14, 2013 1:20 PM
To: user@hadoop.apache.org
Subject: JobClient: Error reading task output - after instituting a DNS
server
So we just configured a local DNS server for hostname resolution and stopped
using a hosts file and now jobs
I have a job that's getting 600s task timeouts during the copy phase of the
reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
it's taking longer than 10 min to do that copy.
The process starts copying when the reduce step is 80% complete. This is a
very IO bound task as
Can I use the FairScheduler to limit the number of map/reduce tasks directly
from the job configuration? E.g. I have 1 job that I know should run a more
limited # of map/reduce tasks than is set as the default, I want to
configure a queue with a limited # of map/reduce tasks, but only apply it to
Hadoop just runs as a standard java process, you should find something that
bridges between OpenCL and java, a quick google search yields:
http://www.jocl.org/
I expect that you'll find everything you need to accomplish the handoff from
your mapreduce code to OpenCL there.
As for HDFS,
We've got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
disk slots max), they chug away ok currently, only slightly IO bound on
average.
I'm going to upgrade the disk configuration at some point (we do need more
space on HDFS) and I'm thinking about what's best hardware-wise:
I think the problem here is that he doesn't have Hadoop installed on this
other location so there's no Hadoop DFS client to do the put directly into
HDFS on, he would normally copy the file to one of the nodes in the cluster
where the client files are installed. I've had the same problem recently.
I just realized another trick you might trying. The Hadoop dfs client can
read input from STDIN, you could use netcat to pipe the stuff across to HDFS
without hitting the hard drive, I haven’t tried it, but here’s what I
would think might work:
On the Hadoop box, open a listening port and feed
For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).
Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
?
Looking forward to hear from you.
Thanks
Himanish
On Fri, Mar 29, 2013 at 10:34 AM, David Parks davidpark...@yahoo.com
wrote:
CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
it primarily with 1.0.3, which is what AWS uses, so I presume that's what
it's
Subject: Re: Which hadoop installation should I use on ubuntu server?
apache bigtop has builds done for ubuntu
you can check them at jenkins mentioned on bigtop.apache.org
On Thu, Mar 28, 2013 at 11:37 AM, David Parks davidpark...@yahoo.com
wrote:
I'm moving off AWS MapReduce to our own
I'm moving off AWS MapReduce to our own cluster, I'm installing Hadoop on
Ubuntu Server 12.10.
I see a .deb installer and installed that, but it seems like files are all
over the place `/usr/share/Hadoop`, `/etc/hadoop`, `/usr/bin/hadoop`. And
the documentation is a bit harder to follow:
. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amazon on our local server ?
Thanks
On Thu, Mar 28, 2013 at 3:56 AM, David Parks davidpark...@yahoo.com wrote:
Have you tried using s3distcp from amazon? I used
below link will be useful..
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html
On Sat, Mar 23, 2013 at 12:29 PM, David Parks davidpark...@yahoo.com
wrote:
For a new installation of the current stable build (1.1.2 ), is there
any reason to use the CheckPointNode over the BackupNode
Can I suggest an answer of Yes, but you probably don't want to?
As a typical user of Hadoop you would not do this. Hadoop already chooses
the best server to do the work based on the location of the data (a server
that is available to do work and also has the data locally will generally be
For a new installation of the current stable build (1.1.2 ), is there any
reason to use the CheckPointNode over the BackupNode?
It seems that we need to choose one or the other, and from the docs it seems
like the BackupNode is more efficient in its processes.
link will be useful..
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html
On Sat, Mar 23, 2013 at 12:29 PM, David Parks davidpark...@yahoo.com
wrote:
For a new installation of the current stable build (1.1.2 ), is there
any reason to use the CheckPointNode over the BackupNode
Good points all,
The mapreduce jobs are, well. intensive. We've got a whole variety, but
typically I see them use a lot of CPU, a lot of Disk, and upon occasion a
whole bunch of Network bandwidth. Duh right? J
The master node is mostly CPU intensive right? We're using LXC to segregate
I want 20 servers, I got 7, so I want to make the most of the 7 I have. Each
of the 7 servers have: 24GB of ram, 4TB, and 8 cores.
Would it be terribly unwise of me to Run such a configuration:
. Server #1: NameNode + Master + TaskTracker(reduced
slots)
. Server
From the release page on hadoop's website:
This release, like previous releases in hadoop-2.x series is still
considered alpha primarily since some of APIs aren't fully-baked and we
expect some churn in future.
How alpha is the 2.x line? We're moving off AWS (1.0.3) onto our own
cluster of
: David Parks davidpark...@yahoo.com
To: user@hadoop.apache.org
Sent: Monday, March 11, 2013 3:23 PM
Subject: Unexpected Hadoop behavior: map task re-running after reducer has been
running
I can’t explain this behavior, can someone help me here:
Kind % Complete Num Tasks Pending Running
We've taken to documenting our Hadoop jobs in a simple visual manner using
PPT (attached example). I wonder how others document their jobs?
We often add notes to the text section of the PPT slides as well.
image001.jpg
(roughly).
On Sat, Feb 9, 2013 at 9:24 AM, David Parks davidpark...@yahoo.com wrote:
I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.
This job is network IO bound, gathering images from a set of webservers.
My job has certain
Are there any rules against writing results to Reducer.Context while in the
cleanup() method?
Ive got a reducer that is downloading a few 10s of millions of images from
a set of URLs feed to it.
To be efficient I run many connections in parallel, but limit connections
per domain and
, reduce # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.
On Mon, Feb 11, 2013 at 7:59 AM, David Parks davidpark...@yahoo.com wrote:
I guess the FairScheduler is doing multiple assignments per
I can't answer your question about the Decompressor interface, but I have a
query for you.
Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is
I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.
This job is network IO bound, gathering images from a set of webservers.
My job has certain parameters set to meet web politeness standards (e.g.
limit connects and
that, you can modify mapred-site.xml to change it from 3 to 1
Best,
--
Nan Zhu
School of Computer Science,
McGill University
On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my
cluster by default
- detecting if what they've
read is of an older time and decoding it appropriately (while handling newer
encoding separately, in the normal fashion).
This would be much better than going down the classloader hack paths I
think?
On Tue, Jan 29, 2013 at 1:11 PM, David Parks davidpark...@yahoo.com wrote
Is it possible to use symbolic links in 1.0.3?
If yes: can I use symbolic links to create a single, final directory
structure of files from many locations; then use DistCp/S3DistCp to copy
that final directory structure to another filesystem such as S3?
Usecase:
I currently launch 4
Anyone have any good tricks for upgrading a sequence file.
We maintain a sequence file like a flat file DB and the primary object in
there changed in recent development.
It's trivial to write a job to read in the sequence file, update the object,
and write it back out in the new format.
Thinking here... if you submitted the task programmatically you should be
able to capture the failure of the task and gracefully move past it to your
next tasks.
To say it in a long-winded way: Let's say you submit a job to Hadoop, a
java jar, and your main class implements Tool. That code has
Here’s an example of running distcp (actually in this case s3distcp, but it’s
about the same, just new DistCp()) from java:
ToolRunner.run(getConf(), new S3DistCp(), new String[] {
--src, /src/dir/,
--srcPattern, .*(itemtable)-r-[0-9]*.*,
--dest,
I'm pretty consistently seeing a few reduce tasks fail with OutOfMemoryError
(below). It doesn't kill the job, but it slows it down.
In my current case the reducer is pretty darn simple, the algorithm
basically does:
1. Do you have 2 values for this key?
2. If so, build a json
I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.
I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).
ToolRunner blocks until the job completes though, so presumably I'd need to
create a
want to begin with.
Thanks for saving me more fruitless searches.
On Dec 11, 2012, at 10:04 PM, David Parks davidpark...@yahoo.com wrote:
You use TextInputFormat, you'll get the following keyLongWritable,
valueText pairs in your mapper:
file_position, your_input
Example:
0,
0\t[356
I'm having exactly this problem, and it's causing my job to fail when I try
to process a larger amount of data (I'm attempting to process 30GB of
compressed CSVs and the entire job fails every time).
This issues is open for it:
https://issues.apache.org/jira/browse/MAPREDUCE-5
Anyone have any
The map task may use a combiner 0+ times. Basically that means (as far as I
understand), if the map output data is below some internal hadoop threshold,
it'll just send it to the reducer, if it's larger then it'll run it through
the combiner first. And at hadoops discretion, it may run the
Assume for a moment that you have a large cluster of 500 AWS spot instance
servers running. And you want to keep the bid price low, so at some point
it's likely that the whole cluster will get axed until the spot price comes
down some.
In order to maintain HDFS continuity I'd want say 10
You use TextInputFormat, you'll get the following keyLongWritable,
valueText pairs in your mapper:
file_position, your_input
Example:
0,
0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]
100,
8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
You're likely to find answers to your questions here, but you'll need
specific questions and some rudimentary subject matter knowledge. I'd
suggest starting off with a good book on Hadoop, you'll probably find a lot
of your questions are answered in a casual afternoon of reading. I was
pretty
I had the same problem yesterday, it sure does look to be dead on that
issue. I found another forum discussion on AWS that suggested more memory as
a stop-gap way to deal with it, or apply the patch. I checked the code on
hadoop 1.0.3 (the version on AWS) and it didn't have the fix, so it looks
I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.
This is the code I use to set up the mapper:
Path lsDir = new
Path(s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*);
for(FileStatus f :
@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times
Could it be due to spec-ex? Does it make a diffrerence in the end?
Raj
_
From: David Parks davidpark...@yahoo.com
To: user@hadoop.apache.org
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks
I'm curious about profiling, I see some documentation about it (1.0.3 on
AWS), but the references to JobConf seem to be for the old api and I've
got everything running on the new api.
I've got a job to handle processing of about 30GB of compressed CSVs and
it's taking over a day with 3
First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by
I want to move a file in HDFS after a job using the Java API, I'm trying
this command but I always get false (could not rename):
Path from = new
Path(hdfs://localhost/process-changes/itemtable-r-1);
Path to = new Path(hdfs://localhost/output/itemtable-r-1);
boolean
Is there an XMLOutputFormat in existence somewhere? I need to output Solr
XML change docs, I'm betting I'm not the first.
David
identify which block of IDs to assign each one?
Thanks,
David
From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations
On Sun, Oct 28, 2012 at 9:15 PM, David Parks davidpark
then you can
cut the evilness of global atomicity by a substantial factor.
Are you sure you need a global counter?
On Fri, Oct 26, 2012 at 11:07 PM, David Parks davidpark...@yahoo.com
wrote:
How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.
Does
How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.
Does Hadoop provide native support for these kinds of operations?
An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.
I've got MultipleOutputs configured to generate 2 named outputs. I'd like to
send one to s3n:// and one to hdfs://
Is this possible? One is a final summary report, the other is input to the
next job.
Thanks,
David
Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.
So I've got some concern as to how
which Data Node gets the
job.
HTH
-Mike
On Oct 24, 2012, at 1:10 AM, David Parks davidpark...@yahoo.com wrote:
Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
But I have 3 jobs
are the constraints that you are working with?
On Mon, Oct 22, 2012 at 5:59 PM, David Parks davidpark...@yahoo.com wrote:
Would it make sense to write a map job that takes an unsplittable XML file
(which defines all of the files I need to download); that one map job then
kicks off the downloads
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources processes them nightly.
Is there a reasonably flexible way to do this in the Hadoop job its self? I
expect the initial downloads to take many hours and I'd hope I can optimize
the # of
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources processes them nightly.
Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the #
a list of files with tupleshost:port,
filePath. Then use a map-only job to pull each file using NLineInputFormat.
Another way is to write a HttpInputFormat and HttpRecordReader and stream
the data in a map-only job.
On Mon, Oct 22, 2012 at 1:54 AM, David Parks davidpark...@yahoo.com wrote:
I want
60 matches
Mail list logo