I'm on CDH4, and trying to recover both the namenode and cloudera manager
VMs from HDFS after losing the namenode.
All of our backup VMs are on HDFS, so for the moment I just want to hack
something together, copy the backup VMs off HDFS and get on with properly
reconfiguring via CDH Manger.
We have a box that's a bit overpowered for just running our namenode and
jobtracker on a 10-node cluster and we also wanted to make use of the
storage and processor resources of that node, like you.
What we did is use LXC containers to segregate the different processes. LXC
is a very light weig
2013 6:56 PM
To: user@hadoop.apache.org
Subject: Re: JobClient: Error reading task output - after instituting a DNS
server
HI David. an you explain in a bit more detail what was the issue? Thanks.
Shahab
On Tue, May 14, 2013 at 2:29 AM, David Parks wrote:
I just hate it when I fi
From: David Parks [mailto:davidpark...@yahoo.com]
Sent: Tuesday, May 14, 2013 1:20 PM
To: user@hadoop.apache.org
Subject: JobClient: Error reading task output - after instituting a DNS
server
So we just configured a local DNS server for hostname resolution and stopped
using a hosts file and now
Hadoop just runs as a standard java process, you should find something that
bridges between OpenCL and java, a quick google search yields:
http://www.jocl.org/
I expect that you'll find everything you need to accomplish the handoff from
your mapreduce code to OpenCL there.
As for HDFS, hado
Can I use the FairScheduler to limit the number of map/reduce tasks directly
from the job configuration? E.g. I have 1 job that I know should run a more
limited # of map/reduce tasks than is set as the default, I want to
configure a queue with a limited # of map/reduce tasks, but only apply it to
t
I have a job that's getting 600s task timeouts during the copy phase of the
reduce step. I see a lot of copy tasks all moving at about 2.5MB/sec, and
it's taking longer than 10 min to do that copy.
The process starts copying when the reduce step is 80% complete. This is a
very IO bound task as
We've got a cluster of 10x 8core/24gb nodes, currently with 1 4TB disk (3
disk slots max), they chug away ok currently, only slightly IO bound on
average.
I'm going to upgrade the disk configuration at some point (we do need more
space on HDFS) and I'm thinking about what's best hardware-wise:
I just realized another trick you might trying. The Hadoop dfs client can
read input from STDIN, you could use netcat to pipe the stuff across to HDFS
without hitting the hard drive, I haven’t tried it, but here’s what I
would think might work:
On the Hadoop box, open a listening port and feed
I think the problem here is that he doesn't have Hadoop installed on this
other location so there's no Hadoop DFS client to do the put directly into
HDFS on, he would normally copy the file to one of the nodes in the cluster
where the client files are installed. I've had the same problem recently.
For a set of jobs to run I need to download about 100GB of data from the
internet (~1000 files of varying sizes from ~10 different domains).
Currently I do this in a simple linux script as it's easy to script FTP,
curl, and the like. But it's a mess to maintain a separate server for that
proces
on EC2 and amazon S3 ?
Looking forward to hear from you.
Thanks
Himanish
On Fri, Mar 29, 2013 at 10:34 AM, David Parks
wrote:
CDH4 can be either 1.x or2.x hadoop, are you using the 2.x line? I've used
it primarily with 1.0.3, which is what AWS uses, so I presume that's wh
ed to get all the jars (
>else it could cause version mismatch errors for HDFS - NoSuchMethodError etc
>etc )
>
>
>Appreciate your help regarding this.
>
>
>- Himanish
>
>
>
>
>On Fri, Mar 29, 2013 at 1:41 AM, David Parks wrote:
>
>None of that complexity, they distrib
: Friday, March 29, 2013 3:21 PM
To: user
Subject: Re: Which hadoop installation should I use on ubuntu server?
I recommend cloudera's CDH4 on ubuntu 12.04 LTS
On Thu, Mar 28, 2013 at 7:07 AM, David Parks wrote:
Im moving off AWS MapReduce to our own cluster, Im installing Hadoop on
U
apache.org
Subject: Re: Which hadoop installation should I use on ubuntu server?
apache bigtop has builds done for ubuntu
you can check them at jenkins mentioned on bigtop.apache.org
On Thu, Mar 28, 2013 at 11:37 AM, David Parks
wrote:
I'm moving off AWS MapReduce to our o
r corporate LAN.Could
you please provide some details on how i could use the s3distcp from amazon
to transfer data from our on-premises hadoop to amazon s3. Wouldn't some
kind of VPN be needed between the Amazon EMR instance and our on-premises
hadoop instance ? Did you mean use the jar from amaz
Have you tried using s3distcp from amazon? I used it many times to transfer
1.5TB between S3 and Hadoop instances. The process took 45 min, well over
the 10min timeout period you're running into a problem on.
Dave
From: Himanish Kushary [mailto:himan...@gmail.com]
Sent: Thursday, March
I'm moving off AWS MapReduce to our own cluster, I'm installing Hadoop on
Ubuntu Server 12.10.
I see a .deb installer and installed that, but it seems like files are all
over the place `/usr/share/Hadoop`, `/etc/hadoop`, `/usr/bin/hadoop`. And
the documentation is a bit harder to follow:
ht
M, varun kumar wrote:
> Hope below link will be useful..
>
> http://hadoop.apache.org/docs/stable/hdfs_user_guide.html
>
>
> On Sat, Mar 23, 2013 at 12:29 PM, David Parks
> wrote:
>>
>> For a new installation of the current stable build (1.1.2 ), is there
>>
Can I suggest an answer of "Yes, but you probably don't want to"?
As a "typical user" of Hadoop you would not do this. Hadoop already chooses
the best server to do the work based on the location of the data (a server
that is available to do work and also has the data locally will generally be
ass
> http://hadoop.apache.org/docs/stable/hdfs_user_guide.html
>
>
> On Sat, Mar 23, 2013 at 12:29 PM, David Parks
> wrote:
>>
>> For a new installation of the current stable build (1.1.2 ), is there
>> any reason to use the CheckPointNode over the BackupNode?
>>
>&g
For a new installation of the current stable build (1.1.2 ), is there any
reason to use the CheckPointNode over the BackupNode?
It seems that we need to choose one or the other, and from the docs it seems
like the BackupNode is more efficient in its processes.
Good points all,
The mapreduce jobs are, well. intensive. We've got a whole variety, but
typically I see them use a lot of CPU, a lot of Disk, and upon occasion a
whole bunch of Network bandwidth. Duh right? J
The master node is mostly CPU intensive right? We're using LXC to segregate
(ps
I want 20 servers, I got 7, so I want to make the most of the 7 I have. Each
of the 7 servers have: 24GB of ram, 4TB, and 8 cores.
Would it be terribly unwise of me to Run such a configuration:
. Server #1: NameNode + Master + TaskTracker(reduced
slots)
. Server #
>From the release page on hadoop's website:
"This release, like previous releases in hadoop-2.x series is still
considered alpha primarily since some of APIs aren't fully-baked and we
expect some churn in future."
How "alpha" is the 2.x line? We're moving off AWS (1.0.3) onto our own
cluste
-failures
Task attempt_201303080219_0002_r_006026_0 failed to report status for 7201
seconds. Killing!
attempt_201303080219_0002_r_006026_0: [Fatal Error] :1:1: Premature end of file.
Too many fetch-failures
Too many fetch-failures
Too many fetch-failures
From: David
I can't explain this behavior, can someone help me here:
Kind % Complete Num Tasks Pending Running Complete Killed Failed/Killed
Task Attempts
map 100.00%23547 0 123546 0 247 / 0
reduce 62.40%13738 30 6232 0 336
We've taken to documenting our Hadoop jobs in a simple visual manner using
PPT (attached example). I wonder how others document their jobs?
We often add notes to the text section of the PPT slides as well.
<>
e # of reducers needed?), but it doesn't affect
scheduling of a set number of reduce tasks nor does a scheduler care
currently if you add that step in or not.
On Mon, Feb 11, 2013 at 7:59 AM, David Parks wrote:
> I guess the FairScheduler is doing multiple assignments per heartbeat,
Are there any rules against writing results to Reducer.Context while in the
cleanup() method?
Ive got a reducer that is downloading a few 10s of millions of images from
a set of URLs feed to it.
To be efficient I run many connections in parallel, but limit connections
per domain and frequ
In the EncryptedWritableWrapper idea you would create an object that takes
any Writable object as it's parameter.
Your EncryptedWritableWrapper would naturally implement Writable.
. When write(DataOutput out) is called on your object, create your
own DataOutputStream which reads da
ughly).
On Sat, Feb 9, 2013 at 9:24 AM, David Parks wrote:
I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.
This job is network IO bound, gathering images from a set of webservers.
My job has certain parameters set to
I can't answer your question about the Decompressor interface, but I have a
query for you.
Why not just create an EncryptedWritable object? Encrypt/decrypt the bytes
on the read/write method, that should be darn near trivial. Then stick with
good 'ol SequenceFile, which, as you note, is splitta
llows you to do
that, you can modify mapred-site.xml to change it from 3 to 1
Best,
--
Nan Zhu
School of Computer Science,
McGill University
On Friday, 8 February, 2013 at 11:24 PM, David Parks wrote:
Hmm, odd, I’m using AWS Mapreduce, and this property is already set to 1 on my
clu
@hadoop.apache.org
Subject: Re: How can I limit reducers to one-per-node?
I think set tasktracker.reduce.tasks.maximum to be 1 may meet your requirement
Best,
--
Nan Zhu
School of Computer Science,
McGill University
On Friday, 8 February, 2013 at 10:54 PM, David Parks wrote:
I have
I have a cluster of boxes with 3 reducers per node. I want to limit a
particular job to only run 1 reducer per node.
This job is network IO bound, gathering images from a set of webservers.
My job has certain parameters set to meet "web politeness" standards (e.g.
limit connects and connect
ersioning perhaps - detecting if what they've
read is of an older time and decoding it appropriately (while handling newer
encoding separately, in the normal fashion).
This would be much better than going down the classloader hack paths I
think?
On Tue, Jan 29, 2013 at 1:11 PM, David Parks wr
Anyone have any good tricks for upgrading a sequence file.
We maintain a sequence file like a flat file DB and the primary object in
there changed in recent development.
It's trivial to write a job to read in the sequence file, update the object,
and write it back out in the new format.
Is it possible to use symbolic links in 1.0.3?
If yes: can I use symbolic links to create a single, final directory
structure of files from many locations; then use DistCp/S3DistCp to copy
that final directory structure to another filesystem such as S3?
Usecase:
I currently launch 4 S3Dist
Thinking here... if you submitted the task programmatically you should be
able to capture the failure of the task and gracefully move past it to your
next tasks.
To say it in a long-winded way: Let's say you submit a job to Hadoop, a
java jar, and your main class implements Tool. That code has th
Here’s an example of running distcp (actually in this case s3distcp, but it’s
about the same, just new DistCp()) from java:
ToolRunner.run(getConf(), new S3DistCp(), new String[] {
"--src", "/src/dir/",
"--srcPattern", ".*(itemtable)-r-[0-9]*.*",
"--des
mapred.map.tasksperslot do?
David,
Could you please tell what version of Hadoop you are using ? I don't see
this parameter in the stable (1.x) or current branch. I only see references
to it with respect to EMR and with Hadoop 0.18 or so.
On Thu, Dec 27, 2012 at 1:51 PM, David Parks
I didn't come up with much in a google search.
In particular, what are the side effects of changing this setting? Memory?
Sort process?
I'm guessing it means that it'll feed 2 map tasks as input to each map task,
a map task in turn is a self-contained JVM which consumes one map slot.
Th
I'm pretty consistently seeing a few reduce tasks fail with OutOfMemoryError
(below). It doesn't kill the job, but it slows it down.
In my current case the reducer is pretty darn simple, the algorithm
basically does:
1. Do you have 2 values for this key?
2. If so, build a json str
I've got 15 boxes in a cluster, 7.5GB of ram each on AWS (m1.large), 1
reducer per node.
I'm seeing this exception sometimes. It's not stopping the job from
completing, it's just failing 3 or 4 reduce tasks and slowing things down:
Error: java.lang.OutOfMemoryError: Java heap space
Client(job);
jc.submitJob(job);
Cheers!
Manoj.
On Fri, Dec 14, 2012 at 10:09 AM, David Parks
wrote:
I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.
I'd like to run an s3distcp job in parallel as well, but the interface to
tha
I'm submitting unrelated jobs programmatically (using AWS EMR) so they run
in parallel.
I'd like to run an s3distcp job in parallel as well, but the interface to
that job is a Tool, e.g. ToolRunner.run(...).
ToolRunner blocks until the job completes though, so presumably I'd need to
create a thre
h the filesystem directly and was leaving a connection to
it open which was hanging the map tasks (without error) that used that code.
-Original Message-
From: David Parks [mailto:davidpark...@yahoo.com]
Sent: Thursday, December 13, 2012 11:23 AM
To: user@hadoop.apache.org
Subject: Shuf
I'm having exactly this problem, and it's causing my job to fail when I try
to process a larger amount of data (I'm attempting to process 30GB of
compressed CSVs and the entire job fails every time).
This issues is open for it:
https://issues.apache.org/jira/browse/MAPREDUCE-5
Anyone have any ide
alls. They aren't quite what I want to begin with.
Thanks for saving me more fruitless searches.
On Dec 11, 2012, at 10:04 PM, David Parks wrote:
You use TextInputFormat, you'll get the following key,
value pairs in your mapper:
file_position, your_input
Example:
0,
"0\t[
I had the same problem yesterday, it sure does look to be dead on that
issue. I found another forum discussion on AWS that suggested more memory as
a stop-gap way to deal with it, or apply the patch. I checked the code on
hadoop 1.0.3 (the version on AWS) and it didn't have the fix, so it looks
lik
You're likely to find answers to your questions here, but you'll need
specific questions and some rudimentary subject matter knowledge. I'd
suggest starting off with a good book on Hadoop, you'll probably find a lot
of your questions are answered in a casual afternoon of reading. I was
pretty happy
You use TextInputFormat, you'll get the following key,
value pairs in your mapper:
file_position, your_input
Example:
0,
"0\t[356:0.3481597,359:0.3481597,358:0.3481597,361:0.3481597,360:0.3481597]"
100,
"8\t[356:0.34786037,359:0.34786037,358:0.34786037,361:0.34786037,360:0.34786
037]"
200,
"25\t[
Assume for a moment that you have a large cluster of 500 AWS spot instance
servers running. And you want to keep the bid price low, so at some point
it's likely that the whole cluster will get axed until the spot price comes
down some.
In order to maintain HDFS continuity I'd want say 10 server
The map task may use a combiner 0+ times. Basically that means (as far as I
understand), if the map output data is below some internal hadoop threshold,
it'll just send it to the reducer, if it's larger then it'll run it through
the combiner first. And at hadoops discretion, it may run the combiner
n the reason for using MultipleInputs ?
On Thu, Dec 6, 2012 at 2:59 PM, David Parks wrote:
Figured it out, it is, as usual, with my code. I had wrapped TextInputFormat
to replace the LongWritable key with a key representing the file name. It
was a bit tricky to do because of changing the generics
adding the same input format for all files, do
you need MultipleInputs ?
Thanks
Hemanth
On Thu, Dec 6, 2012 at 1:06 PM, David Parks wrote:
I believe I just tracked down the problem, maybe you can help confirm if
you're familiar with this.
I see that FileInputFormat is specifying
M
To: user@hadoop.apache.org
Subject: Re: Map tasks processing some files multiple times
Could it be due to spec-ex? Does it make a diffrerence in the end?
Raj
_
From: David Parks
To: user@hadoop.apache.org
Sent: Wednesday, December 5, 2012 10:15 PM
Subject: Map tasks processing
I've got a job that reads in 167 files from S3, but 2 of the files are being
mapped twice and 1 of the files is mapped 3 times.
This is the code I use to set up the mapper:
Path lsDir = new
Path("s3n://fruggmapreduce/input/catalogs/linkshare_catalogs/*~*");
for(FileStatus f :
First rule to be wary of is your use of the combiner. The combiner *might*
be run, it *might not* be run, and it *might be run multiple times*. The
combiner is only for reducing the amount of data going to the reducer, and
it will only be run *if and when* it's deemed likely to be useful by Hadoop.
I'm curious about profiling, I see some documentation about it (1.0.3 on
AWS), but the references to JobConf seem to be for the "old api" and I've
got everything running on the "new api".
I've got a job to handle processing of about 30GB of compressed CSVs and
it's taking over a day with 3 m1.m
I want to move a file in HDFS after a job using the Java API, I'm trying
this command but I always get false (could not rename):
Path from = new
Path("hdfs://localhost/process-changes/itemtable-r-1");
Path to = new Path("hdfs://localhost/output/itemtable-r-1");
bool
Is there an XMLOutputFormat in existence somewhere? I need to output Solr
XML change docs, I'm betting I'm not the first.
David
ch that I could
identify which block of IDs to assign each one?
Thanks,
David
From: Ted Dunning [mailto:tdunn...@maprtech.com]
Sent: Monday, October 29, 2012 12:58 PM
To: user@hadoop.apache.org
Subject: Re: Cluster wide atomic operations
On Sun, Oct 28, 2012 at 9:15 PM, David P
f you an batch these operations up then you can
cut the evilness of global atomicity by a substantial factor.
Are you sure you need a global counter?
On Fri, Oct 26, 2012 at 11:07 PM, David Parks
wrote:
How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment co
How can we manage cluster-wide atomic operations? Such as maintaining an
auto-increment counter.
Does Hadoop provide native support for these kinds of operations?
An in case ultimate answer involves zookeeper, I'd love to work out doing
this in AWS/EMR.
I've got MultipleOutputs configured to generate 2 named outputs. I'd like to
send one to s3n:// and one to hdfs://
Is this possible? One is a final summary report, the other is input to the
next job.
Thanks,
David
ocal so it doesn't matter which Data Node gets the
job.
HTH
-Mike
On Oct 24, 2012, at 1:10 AM, David Parks wrote:
Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
Even after reading O'reillys book on hadoop I don't feel like I have a clear
vision of how the map tasks get assigned.
They depend on splits right?
But I have 3 jobs running. And splits will come from various sources: HDFS,
S3, and slow HTTP sources.
So I've got some concern as to how t
What are the constraints that you are working with?
On Mon, Oct 22, 2012 at 5:59 PM, David Parks wrote:
Would it make sense to write a map job that takes an unsplittable XML file
(which defines all of the files I need to download); that one map job then
kicks off the downloads in multiple thr
list of files with tuples. Then use a map-only job to pull each file using NLineInputFormat.
Another way is to write a HttpInputFormat and HttpRecordReader and stream
the data in a map-only job.
On Mon, Oct 22, 2012 at 1:54 AM, David Parks wrote:
I want to create a MapReduce job which reads many
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.
Is there a reasonably flexible way to acquire the files in the Hadoop job
its self? I expect the initial downloads to take many hours and I'd hope I
can optimize the #
I want to create a MapReduce job which reads many multi-gigabyte input files
from various HTTP sources & processes them nightly.
Is there a reasonably flexible way to do this in the Hadoop job its self? I
expect the initial downloads to take many hours and I'd hope I can optimize
the # of connecti
73 matches
Mail list logo