I'm interested in joining too. Let me know when and where. We can
always head down to a bar or coffee shop in seattle too.
Tushar
On Mon, Apr 20, 2009 at 6:31 PM, Lauren Cooney laurencoo...@gmail.com wrote:
If you guys are interested in space over in Redmond, I can see if MSFT can
host. Let
Hi all!
I have some jobs: job1, job2, job3,... . Each job working with the
group. To control jobs, I have JobControllers, each JobController
control jobs follow the specified group.
Example:
- Have 2 Group: g1 and g2
- 2 JobController: jController1, jcontroller2
+ jController1
You need to start each JobControl in its own thread so they can run
concurrently. Something like:
Thread t = new Thread(jobControl);
t.start();
Then poll the jobControl.allFinished() method.
Tom
On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr
nguyenhuynh...@gmail.com wrote:
Hi all!
Ian Soboroff wrote:
Steve Loughran ste...@apache.org writes:
I think from your perpective it makes sense as it stops anyone getting
itchy fingers and doing their own RPMs.
Um, what's wrong with that?
It's reallly hard to do good RPM spec files. If cloudera are willing to
pay Matt to do
Tim Wintle wrote:
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
1) I can pick the language that offers a different programming
paradigm (e.g. I may choose functional language, or logic programming
if they suit the problem better). In fact, I can even chosen Erlang
at the map() and
I would be interested in understanding what problems you are having,
we are using 19.0 in production on EC2, running nutch and a set of
custom apps
in a mixed workload on a farm of 5 instances.
On 17 Apr 2009, at 18:05, Ted Coyle wrote:
Rakhi,
I'd suggest going to 0.19.1. hbase and
Andrew Newman wrote:
They are comparing an indexed system with one that isn't. Why is
Hadoop faster at loading than the others? Surely no one would be
surprised that it would be slower - I'm surprised at how well Hadoop
does. Who want to write a paper for next year, grep vs reverse
index?
On Mon, Apr 20, 2009 at 7:24 PM, Brian Bockelman bbock...@cse.unl.edu wrote:
Hey Jason,
Wouldn't this be avoided if you used a combiner to also perform the max()
operation? A minimal amount of data would be written over the network.
I can't remember if the map output gets written to disk
Cam Macdonell wrote:
Well, for future googlers, I'll answer my own post. Watch our for the
hostname at the end of localhost lines on slaves. One of my slaves
was registering itself as localhost.localdomain with the jobtracker.
Is there a way that Hadoop could be made to not be so
Jim Twensky wrote:
Yes, here is how it looks:
property
namehadoop.tmp.dir/name
value/scratch/local/jim/hadoop-${user.name}/value
/property
so I don't know why it still writes to /tmp. As a temporary workaround, I
created a symbolic link from /tmp/hadoop-jim to
Aaron Kimball wrote:
Cam,
This isn't Hadoop-specific, it's how Linux treats its network configuration.
If you look at /etc/host.conf, you'll probably see a line that says order
hosts, bind -- this is telling Linux's DNS resolution library to first read
your /etc/hosts file, then check an
On Mon, Apr 20, 2009 at 1:14 PM, Stuart White stuart.whi...@gmail.comwrote:
Is this the best/only way to deal with this? It would be better if
hadoop offered the option of writing different outputs to different
output directories, or if getmerge offered the ability to specify a
file prefix
Hi there,
We're working on an image analysis project. The image processing code is
written in Matlab. If I invoke that code from a shell script and then use
that shell script within Hadoop streaming, will that work? Has anyone done
something along these lines?
Many thaks,
--ST.
Sameer,
I'd also be interested in that as well; We are constructing a hadoop
cluster for energy data (PMU) for the NERC and we will be potentially
running jobs for a number of groups and researchers. I know some
researchers will know nothing of map reduce, yet are very keen on
MatLab, so we're
Stuart,
I once used MultipleOutputFormat and created
(mapred.work.output.dir)/type1/part-_
(mapred.work.output.dir)/type2/part-_
...
And JobTracker took care of the renaming to
(mapred.output.dir)/type{1,2}/part-__
Would that work for you?
Koji
-Original
If you can compile the matlab code to an executable with the matlab
compiler and send it to the nodes with the distributed cache that
should work... You probably want to avoid licensing fees for running
copies of matlab itself on the cluster.
Sent from my iPhone
On Apr 21, 2009, at 1:55
On Tue, Apr 21, 2009 at 12:06 PM, Todd Lipcon t...@cloudera.com wrote:
Would dfs -cat do what you need? e.g:
./bin/hdfs dfs -cat /path/to/output/ExceptionDocuments-m-\*
/tmp/exceptions-merged
Yes, that would work. Thanks for the suggestion.
On Tue, Apr 21, 2009 at 1:00 PM, Koji Noguchi knogu...@yahoo-inc.com wrote:
I once used MultipleOutputFormat and created
(mapred.work.output.dir)/type1/part-_
(mapred.work.output.dir)/type2/part-_
...
And JobTracker took care of the renaming to
Resending the query with a different subject -
Was : FileSystem.listStatus() doesn't return list of files in hdfs directory I
have a single-node hadoop cluster. The hadoop version -
[patn...@ac4-dev-ims-211]~/dev/hadoop/hadoop-0.19.1% hadoop version
Hadoop 0.19.1
Subversion
Hi.
I have quite a strange issue, where one of the datanodes that I have,
rejects any blocks with error messages.
I looked in the datanode logs, and found the following error:
2009-04-21 16:59:19,092 ERROR org.apache.hadoop.dfs.DataNode:
DatanodeRegistration(192.168.253.20:50010,
You can use any of these:
1. bin/hadoop dfs -get hdfsfile remote filename
2. Thrift API : http://wiki.apache.org/hadoop/HDFS-APIs
3. use fuse-mount ot mount hdfs as a regular file system on remote machine:
http://wiki.apache.org/hadoop/MountableHDFS
thanks,
dhruba
On Mon, Apr 20, 2009 at
I set my mapred.tasktracker.map.tasks.maximum to 10, but when I run a
task, it's only using 2 out of 10, any way to know why it's only using 2?
thanks
Hi again.
Other tools, like balancer, or the web browsing from namenode, don't work as
well.
This because other nodes complain about not reaching the offending node as
well.
I even tried netcat'ing the IP/port from another node - and it successfully
connected.
Any advice on this No route to
Something in the lines of
... class MyOutputFormat extends MultipleTextOutputFormatText, Text {
protected String generateFileNameForKeyValue(Text key,
Text v, String name) {
Path outpath = new Path(key.toString(), name);
return
It's probably a silly question, but you do have more than 2 mappers on
your second job?
If yes, I have no idea what's happening.
Koji
-Original Message-
From: javateck javateck [mailto:javat...@gmail.com]
Sent: Tuesday, April 21, 2009 1:38 PM
To: core-user@hadoop.apache.org
Subject:
Hi Koji,
Thanks for helping.
I don't know why hadoop is just using 2 out of 10 map tasks slots.
Sure, I just cut and paste the job tracker web UI, clearly I set the max
tasks to 10(which I can verify from hadoop-site.xml and from the individual
job configuration also), and I did have the first
is your input data compressed? if so then you will get one mapper per file
Miles
2009/4/21 javateck javateck javat...@gmail.com:
Hi Koji,
Thanks for helping.
I don't know why hadoop is just using 2 out of 10 map tasks slots.
Sure, I just cut and paste the job tracker web UI, clearly I
no, it's plain text file with \t delimited. And I'm expecting one mapper per
file, because I have 175 files, and I got 189 map tasks from what I can see
from the web UI. My issue is that since I have 189 map tasks waiting, why
hadoop is just using 2 of my 10 map slots, and I assume that all map
I want to have something to clarify, for the max task slots, are these
places to check:
1. hadoop-site.xml
2. the specific job's job.conf which can be retrieved though the job, for
example, logs/job_200904212336_0002_conf.xml
Any other place to limit the task map counts?
In my case, it's
Very naively looking at the code, the exception you see is happening in the
write path, on the way to sending a copy of your data to a second data
node. One data node is pipelining the data to another, and that connection
is failing. The fact that DatanodeRegistration is mentioned in the
Hi, What is the input data?
According to my understanding, you have a lot of images and want to
process all images using your matlab script. Then, You should write
some code yourself. I did similar thing for plotting graph with
gnuplot. However, If you want to do large-scale linear algebra
anyone knows why setting *mapred.tasktracker.map.tasks.maximum* not working?
I set it to 10, but still see only 2 map tasks running when running one job
they are the places to check. a job can itself over-ride the number
of mappers and reducers. for example, using streaming, i often state
the number of mappers and reducers i want to use:
-jobconf mapred.reduce.tasks=30
this would tell hadoop to use 30 reducers, for example.
if you don't have
Hi Edward,
Yes, we're building this for handling hundreds of thousands images (at
least). We're thinking processing of individual images (or a set of images
together) will be done in Matlab itself. However, we can use Hadoop
framework to process the data in parallel fashion. One Matlab instance
Tom White wrote:
You need to start each JobControl in its own thread so they can run
concurrently. Something like:
Thread t = new Thread(jobControl);
t.start();
Then poll the jobControl.allFinished() method.
Tom
On Tue, Apr 21, 2009 at 10:02 AM, nguyenhuynh.mr
Hi,
Where to store the images? How to retrieval the images?
If you have a metadata for the images, the map task can receives a
'filename' of image as a key, and file properies (host, file path,
..,etc) as its value. Then, I guess you can handle the matlab process
using runtime object on hadoop
There will be a short summary of the hadoop aggregation tools in ch08, it
got missed in the first pass through, and is being added back in this week.
There are a number of howto's in the book particularly in ch08 and ch09.
I hope you enjoy them.
On Tue, Apr 21, 2009 at 8:24 AM, Edward Capriolo
There is no reason to use a combiner in this case, as there is only a single
output record from the map.
Combiners buy you data reduction when you have output values in your map
that share keys, and your application allows you to do something with the
values that results in smaller/fewer records
There must be only 2 input splits being produced for your job.
Either you have 2 unsplitable files, or the input file(s) you have are not
large enough compared to the block size to be split.
Table 6-1 in chapter 06 gives a breakdown of all of the configuration
parameters that affect split size in
For reasons that I have never bothered to investigate I have never had a
cluster work when the hadoop.tmp.dir was not identical on all of the nodes.
My solution has always been to just make a symbolic link so that
hadoop.tmp.dir was identical and on the machine in question really ended up
in the
Hey Jason,
We've never had the hadoop.tmp.dir identical on all our nodes.
Brian
On Apr 22, 2009, at 10:54 AM, jason hadoop wrote:
For reasons that I have never bothered to investigate I have never
had a
cluster work when the hadoop.tmp.dir was not identical on all of the
nodes.
My
Hi there,
I setup a small cluster for testing. When I start my cluster on my master
node, I have to type the password for starting each datanode and
tasktracker. That's pretty annoying and may be hard to handle when the
cluster grows. Any graceful way to handle this?
Best,
Arber
On Wed, Apr 22, 2009 at 9:26 AM, Yabo-Arber Xu arber.resea...@gmail.com wrote:
Hi there,
I setup a small cluster for testing. When I start my cluster on my master
node, I have to type the password for starting each datanode and
tasktracker. That's pretty annoying and may be hard to handle
I would recommend installing the Hadoop RPMs and avoid the start-all scripts
all together. The RPMs ship with init scripts, allowing you to start and
stop daemons with /sbin/service (or with a configuration management tool,
which I assume you'll be using as your cluster grows). Here's more info
Arber,
A. You have to first setup authorization keys
1. Execute the following command to generate keys: ssh-keygen
2. When prompted for filenames and pass phrases press ENTER to accept
default values.
3. After the command has finished generating keys, enter the following
command to change into
Thanks for all your help, especially Asteem's detailed instruction. It works
now!
Alex: I did not use RPMs, but several of my existing nodes are installed
with Ubuntu. Is there any diff on running Hadoop on Ubuntu? I am thinking of
choosing one before I started scaling up the cluster, but not
46 matches
Mail list logo