Hadoop-2438

2008-01-22 Thread Miles Osborne
Has there been any progress / a work-around for this? Currently I'm experimenting with Streaming and I've encountered what looks like the same problem as described here: https://issues.apache.org/jira/browse/HADOOP-2438 So, I get much the same errors (see below). For this particular task, when

Re: Hadoop-2438

2008-01-22 Thread Miles Osborne
se unloved boxes) Miles On 22/01/2008, Arun C Murthy <[EMAIL PROTECTED]> wrote: > > > On Jan 22, 2008, at 6:26 AM, Miles Osborne wrote: > > > Has there been any progress / a work-around for this? > > > > Currently I'm experimenting with Streaming and I'

Re: Hadoop-2438

2008-01-22 Thread Miles Osborne
that the mapper > died. > > best thing to do is find the stderr log for the task (from the > jobtracker ui) > > and find if the mapper left something there before dying. > > > > > > if streaming gurus are reading this - i am curious about one unrelated > thing

Re: Hadoop-2438

2008-01-22 Thread Miles Osborne
There are machine-learning papers dealing with Map Reduce proper, eg: *Map-Reduce for Machine Learning on Multicore*. Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary Bradski, Andrew Y. Ng and Kunle Olukotun. In *NIPS 19*, 2007. [ps

Re: Hadoop-2438

2008-01-27 Thread Miles Osborne
educe. > Collaborative > Filtering > in > particular. > If > there > are > some > lists/groups/ > publications > related > to > this > subject > I > will > appreciate > any > pointers. > > Sincerely, > Vadim > > > > > >

Re: Hadoop future?

2008-02-01 Thread Miles Osborne
If MS buy Y!, then the natural thing would be for Google to take over the lead. Miles On 01/02/2008, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > > A little bit of speculation is ok I guess. It was good to see what Doug > said above. > > I am obviously concerned about my favorite software both as

Re: Hadoop future?

2008-02-01 Thread Miles Osborne
le of this with Google supporting some US Universities using Hadoop: http://radar.oreilly.com/archives/2007/10/google_ibm_give.html Miles On 01/02/2008, Raghu Angadi <[EMAIL PROTECTED]> wrote: > > Miles Osborne wrote: > > If MS buy Y!, then the natural thing would be for Google to t

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Miles Osborne
This is exactly the same as word counting, except that you have a second pass to find the top n per block of data (this can be done in a mapper) and then a reducer can quite easily merge the results together. This wouldn't be homework, would it? MIles On 04/02/2008, Tarandeep Singh <[EMAIL PROTE

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Miles Osborne
(for example a Bloomier Filter) and count directly in that. This would lead to some quantifiable error rate, which may be acceptable for your application. Miles On 04/02/2008, Tarandeep Singh <[EMAIL PROTECTED]> wrote: > > On Feb 4, 2008 2:11 PM, Miles Osborne <[EMAIL PROTECTED]> wr

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Miles Osborne
How stable is the code? I could quite easily set some undergraduate project to do something with it, for example process query logs Miles On 04/02/2008, Ted Dunning <[EMAIL PROTECTED]> wrote: > > > This is a great opportunity for me to talk about the Groovy support that I > have just gotten runn

Re: hadoop: how to find top N frequently occurring words

2008-02-04 Thread Miles Osborne
sorry, I meant Groovy Miles On 04/02/2008, Tarandeep Singh <[EMAIL PROTECTED]> wrote: > > On Feb 4, 2008 2:40 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > > How stable is the code? I could quite easily set some undergraduate > project > > to do something w

Re: key/value after reduce

2008-02-12 Thread Miles Osborne
You may well have another Map operation operate over the Reducer output, in which case you'd want key-value pairs. Miles On 12/02/2008, Yuri Pradkin <[EMAIL PROTECTED]> wrote: > > Hi, > > I'm relatively new to Hadoop and I have what I hope is a simple > question: > > I don't understand why the ke

Re: key/value after reduce

2008-02-12 Thread Miles Osborne
d your point: if the reducer's > output is in a key/value format, you still can run another map over it > or another reduce, can't you? If the output isn't, you can't; it's up > to the user who coded up the Reducer. What am I missing? > > Thanks, > >

Re: Yahoo's production webmap is now on Hadoop

2008-02-19 Thread Miles Osborne
that 10k number is probably a large under-estimate; perhaps add a an extra zero to get something closer. still, impressive stuff. Miles On 19/02/2008, Toby DiPasquale <[EMAIL PROTECTED]> wrote: > > On Feb 19, 2008 12:58 PM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > The link inversion and rank

Re: Add your project or company to the powered by page?

2008-02-20 Thread Miles Osborne
Please could you add this text: > At ICCS http://www.iccs.informatics.ed.ac.uk/ We are using Hadoop and Nutch to crawl Blog posts and later process them. Hadoop is also beginning to be used in our teaching and general research activities on natural language processing and machine learning. > On

Re: Problem with LibHDFS

2008-02-21 Thread Miles Osborne
Since you are compiling a C(++) program, why not add the -g switch and run it within gdb: that will tell people which line it crashes at (etc etc) Miles On 21/02/2008, Raghavendra K <[EMAIL PROTECTED]> wrote: > > Hi, > I am able to get Hadoop running and also able to compile the libhdfs. > But

Cross-data centre DFS communication?

2008-02-28 Thread Miles Osborne
Currently, we have the following setup: --cluster A, running Nutch: small RAM per node --cluster B, just running Hadoop: lots of RAM per node At some point in the future we will want cluster B to talk to cluster A, and ideally this should be DFS-to-DFS Is this possible? Or do we need to do so

Re: How can reducers start before the mappers have finished?

2008-03-03 Thread Miles Osborne
Reducers can copy the Mapper output prior to actual reducing (if you look at the GUI, you will see "copy", "sort" and actual reducing) MIles On 03/03/2008, Marc Harris <[EMAIL PROTECTED]> wrote: > > I noticed when reading http://wiki.apache.org/hadoop/HardwareBenchmarks > the following comment:

Re: What's the best way to get to a single key?

2008-03-03 Thread Miles Osborne
It should be possible to use the hash of a key to work-out which shard it is present in; you would then search over all entries in the relevant shard. Miles On 03/03/2008, Xavier Stevens <[EMAIL PROTECTED]> wrote: > > I am curious how others might be solving this problem. I want to > retrieve a

Re: 答复: clustering problem

2008-03-05 Thread Miles Osborne
Did you use exactly the same version of Hadoop on each and every node? Miles On 05/03/2008, Ved Prakash <[EMAIL PROTECTED]> wrote: > > Hi Zhang, > > Thanks for your reply, I tried this but no use. It still throws up > Incompatible build versions. > > I removed the dfs local directory on slave and

Re: Pipes task being killed

2008-03-05 Thread Miles Osborne
Is this also true for streaming? Miles On 05/03/2008, Richard Kasperski <[EMAIL PROTECTED]> wrote: > > I think you just need to write to stderr. My understanding is that > hadoop is happy as long as input is being consumed, output is being > generated or status is being generated. > > > Rahul Soo

Re: Howto?: Monitor File/Job allocation

2008-03-26 Thread Miles Osborne
>From here: http://wiki.apache.org/hadoop/TaskExecutionEnvironment The following properties are localized for each task's JobConf: *Name* *Type* *Description* mapred.job.id String The job id mapred.task.id String The task id mapred.task.is.map boolean Is this a map task mapred.task.p

Re: compressed/encrypted file

2008-06-04 Thread Miles Osborne
You can compress / decompress at many points: --prior to mapping --after mapping --after reducing (I've been experimenting with all these options; we have been crawling blogs every day since Feb and we store on DFS compressed sets of posts) If your inputs to maps are compressed, then you don't

Re: compute document frequency with hadoop-streaming

2008-06-09 Thread Miles Osborne
Well, you could have one document per line and another field could easily be the filename eg name\tdocument\n name\tdocument\n etc Miles 2008/6/9 xinfan meng <[EMAIL PROTECTED]>: > In hadoopstreaming, we accept input from stdin. If we want to compute the > document frequncy of words, the somp

Re: does anyone have idea on how to run multiple sequential jobs with bash script

2008-06-10 Thread Miles Osborne
You have another problem in that Hadoop is still initialising --this will cause subsequent jobs to fail. I've not yet migrated to 17.0 (I still use 16.3), but all my jobs are done from nohuped scripts. If you really want to check on the running status and busy wait, you can look at the jobtracker

Streaming --counters question

2008-06-10 Thread Miles Osborne
Is there support for counters in streaming? In particular, it would be nice to be able to access these after a job has run. Thanks! Miles -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
I suspect that many people are using Hadoop with a moderate number of nodes and expecting to see a win over a sequential, single node version. The result (and I've seen this too) is typically that the single node version wins hands-down. Apart from speeding-up the Hadoop job (eg via compression,

Re: hadoop benchmarked, too slow to use

2008-06-10 Thread Miles Osborne
Why not do a little experiment and see what the timing results are when using a range of reducers eg 1, 2, 5, 7, 13 Miles 2008/6/11 Elia Mazzawi <[EMAIL PROTECTED]>: > > yes there was only 1 reducer, how many should i try ? > > > > Joydeep Sen Sarma wrote: > >> how many reducers? Perhaps u are

Re: Streaming --counters question

2008-06-11 Thread Miles Osborne
great! looking forwards to 0.18 Miles 2008/6/11 Arun C Murthy <[EMAIL PROTECTED]>: > > On Jun 10, 2008, at 3:16 PM, Miles Osborne wrote: > > Is there support for counters in streaming? In particular, it would be >> nice >> to be able to access these af

Re: Hadoop not stopping processes

2008-06-12 Thread Miles Osborne
(for 16.4), I've noticed that stop-all.sh sometimes doesn't work when the corresponding start-all was done as a cron job at system boot. also, if your USER differs, it won't work eg as root, with USER=root, start-all.sh needs a corresponding stop-all.sh, but also as root; if instead, you su to

Re: getting hadoop job status/progress outside of hadoop

2008-06-17 Thread Miles Osborne
try this: hadoop job -Dmapred.job.tracker=hermitage:9001 -status job_200806160820_0430 (and replace my job id with the one you want to track): > hadoop job -Dmapred.job.tracker=hermitage:9001 -status job_200806160820_0430 Job: job_200806160820_0430 file: /data/tmp/hadoop/mapred/system/job_2008

Re: getting hadoop job status/progress outside of hadoop

2008-06-17 Thread Miles Osborne
To get this from some other application rather than Hadoop, you just need to run this within a shell (I do this kind of thing within perl) Miles 2008/6/17 Miles Osborne <[EMAIL PROTECTED]>: > try this: > > hadoop job -Dmapred.job.tracker=hermitage:9001 -status > job_20080616

Re: getting hadoop job status/progress outside of hadoop

2008-06-17 Thread Miles Osborne
gt; current job is not stalled or failed? > Is there a way I can avoid specifying a job by the job ID? > I apologize if there's some commandline documentation I'm missing, > but the commands change a bit from point version to version. > > On Tue, Jun 17, 2008 at 1:41 PM, Mi

Re: best command line way to check up/down status of HDFS?

2008-06-27 Thread Miles Osborne
that won't work since the namenode may be down, but the secondary namenode may be up instead why not instead just look at the respective logs? Miles 2008/6/27 Meng Mao <[EMAIL PROTECTED]>: > Is running: > ps aux | grep [\\.]NameNode > > and looking for a non empty response a good way to test HD

Re: best command line way to check up/down status of HDFS?

2008-06-27 Thread Miles Osborne
g if possible. Not that reading through a > log > is that intensive, > but it'd be cleaner if I could poll either Hadoop itself or inspect the > processes running. > > On Fri, Jun 27, 2008 at 1:23 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > > > that won't

Re: Hadoop Architecture Question: Distributed Information Retrieval

2008-07-10 Thread Miles Osborne
If you tell Hadoop to use a single reducer, it should produce a single file of output. btw, you do know about Nutch I presume? http://lucene.apache.org/nutch/ This is a distributed IR system built using Hadoop. Miles 2008/7/10 Kylie McCormick <[EMAIL PROTECTED]>: > Hello! > My name is Kylie Mc

Re: Is it possible to input two different files under same mapper

2008-07-11 Thread Miles Osborne
why not just pass the large file name as an argument to your mappers? each mapper could then access that file as it saw fit, without having to go through contortions. Miles 2008/7/11 Muhammad Ali Amer <[EMAIL PROTECTED]>: > Thanks Mori, > So far I cannot touch the large file, its just a very v

Re: can hadoop read files backwards

2008-07-18 Thread Miles Osborne
unless you have a gigantic number of items with the same id, this is straightforward. have a mapper emit items of the form: key=id, value = type,timestamp and your reducer will then see all ids that have the same value together. it is then a simple matter to process all items with the same id.

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
... or better still, set the number of reducers to zero Milles 2008/7/21 Christian Ulrik Søttrup <[EMAIL PROTECTED]>: > Hi, > > you can simply use the built in reducer that just copies the map output: > > conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); > > Cheers, > Chr

Re: Can a MapReduce task only consist of a Map step?

2008-07-21 Thread Miles Osborne
then just do what i said --set the number of reducers to zero. this should just run the mapper phase 2008/7/21 Zhou, Yunqing <[EMAIL PROTECTED]>: > since the whole data is 5TB. the Identity reducer still cost a lot of > time. > > On Mon, Jul 21, 2008 at 5:09 PM, Christian Ulrik Søttrup <[EMAIL

Re: newbie install

2008-07-22 Thread Miles Osborne
In the first instance make sure that all the relevant ports are actually open. I would also check that your conf files are ok. Looking at the example below, it seems that /work has a permissions problem. (Note that telnet has nothing to do with Hadoop as far as I'm aware --a better test would b

Re: How to move files from one location to another on hadoop

2008-07-30 Thread Miles Osborne
bin/hadoop dfs -mv Miles 2008/7/30 Rutuja Joshi <[EMAIL PROTECTED]>: > Hi all, > > Could anyone suggest any efficient way to move files from one location to > another on Hadoop. Please note that both the locations are on HDFS. > I tried looking for inbuilt file system APIs but couldn't find anyt

Re: reduce job did not complete in a long time

2008-08-07 Thread Miles Osborne
you should use the web UI --each mapper / reducer can be inspected and there is no need to ssh in. Miles 2008/8/7 Karl Anderson <[EMAIL PROTECTED]> > > On 28-Jul-08, at 6:33 PM, charles du wrote: > > Hi: >> >> I tried to run one of my map/reduce jobs on a cluster (hadoop 0.17.0). >> I used 10 r

Re: Hadoop on Suse

2008-08-21 Thread Miles Osborne
yes and it works out-of-the-box Miles 2008/8/21 Wasim Bari <[EMAIL PROTECTED]> > Hi, >Anyone experience with installing Hadoop or HDFS on Suse Linux? > > Thanks -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

Re: Customize job name in command line

2008-08-22 Thread Miles Osborne
yes: -jobconf mapred.job.name is your friend Miles 2008/8/22 Kevin <[EMAIL PROTECTED]> > Hi group, > > Is it possible to customize the job name when using "bin/hadoop jar ..."? > > Best, > -Kevin > -- The University of Edinburgh is a charitable body, registered in Scotland, with registrati

Re: Please help me: is there a way to "chown" in Hadoop?

2008-08-26 Thread Miles Osborne
Look at the other commands dfs provides and you will see the answer Miles 2008/8/26 Gopal Gandhi <[EMAIL PROTECTED]> > I need to change a file's owner from userA to userB. Is there such a > command? Thanks lot! > > % hadoop dfs -ls file > /user/userA/file2008-08-25 20:00 rwxr-xr-x user

Re: how use only a reducer without a mapper

2008-08-27 Thread Miles Osborne
Streaming has the ability to accept as input multiple directories, so that would enable you to merge two directories (--is this an assignment? ...) Miles 2008/8/27 Leandro Alvim <[EMAIL PROTECTED]> > Hi, I need help if it's possible. > > My name is Leandro Alvim and i`m a graduated in computer

Re: Timeouts at reduce stage

2008-08-29 Thread Miles Osborne
The problem here is that when a mapper fails, it may either be due to some bug within that mapper OR it may be due to hardware problems of one kind and another (disks getting full etc etc). if you configure hadoop to use job replication, then in either case, a failing job will get resubmitted mult

Re: Hadoop for computationally intensive tasks (no data)

2008-09-04 Thread Miles Osborne
have a look at the various machine learning applications of Map Reduce: they do lots of computations and here, the data corresponds to intermediate values being used to update counts etc. bedtime reading: Mahout: (machine learning under Hadoop) http://lucene.apache.org/mahout/ some machine lea

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-17 Thread Miles Osborne
hello Chris! (if you are talking about serving language models and/or phrase tables) i had a student look at using HBase for LMs this summer. i don't think it is sufficiently quick to deal with millions of queries per second, but that may be due to blunders on our part. it may be possible that

Re: Do all Mapper outputs with same key go to same Reducer?

2008-09-19 Thread Miles Osborne
> So here's my question -- does Hadoop guarantee that all records with the same key will end up in the same Reducer task? If that's true, > yes --think of the record as being sent to the task by hashing over the key Miles 2008/9/19 Stuart Sierra <[EMAIL PROTECTED]>: > Hi all, > The short versio

Re: Serving contents of large MapFiles/SequenceFiles from memory across many machines

2008-09-19 Thread Miles Osborne
the problem here is that you don't want each mapper/reducer to have a copy of the data. you want that data --which can be very large-- stored in a distributed manner over your cluster and allow random access to it during computation. (this is what HBase etc do) Miles 2008/9/19 Stuart Sierra <[E

Re: no speed-up with parallel matrix calculation

2008-09-19 Thread Miles Osborne
if each mapper only sees a relatively small chunk of the data, then why not have each one compute the counting of 2-perms in memory. you would then get the reducer to merge these partial results together. (details are left to the reader ...) Miles 2008/9/19 Sandy <[EMAIL PROTECTED]>: > Hi, > >

Re: no speed-up with parallel matrix calculation

2008-09-19 Thread Miles Osborne
to disk, and instead just store it and > place it directly as input for the second reduce? > > Thanks, > > -SM > > On Fri, Sep 19, 2008 at 3:13 PM, Miles Osborne <[EMAIL PROTECTED]> wrote: > >> if each mapper only sees a relatively small chunk of the data, then >>

Re: Could not find any valid local directory for task_200809041356_0042_r_000000_2/intermediate.9

2008-09-29 Thread Miles Osborne
check that you are not getting disk full errors Miles 2008/9/29 Elia Mazzawi <[EMAIL PROTECTED]>: > in more detail, my program is happily chugging along until the reducer fails > with that exception, then it looks like it retries and fails by itself. > the same hadoop program works fine on a subs

Re: Gets sum of all integers between map tasks

2008-10-07 Thread Miles Osborne
this is a well known problem. basically, you want to aggregate values computed at some previous step. --emit pairs and have the reducer simply sum-up the probabilities for a given category (it is the same task as summing-up the word counts) Miles 2008/10/7 Edward J. Yoon <[EMAIL PROTECTED]>:

Re: Can I startup 2 datanodes on 1 machine?

2008-10-07 Thread Miles Osborne
you can specify multiple data directories in your conf file > dfs.data.dir Comma separated list of paths on the local filesystem of a DataNode where it should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typical

Re: Newbie doubt: Where are the files/directories?

2008-10-11 Thread Miles Osborne
data under Hadoop is stored as blocks and is not visible using normal Unix commands such as "ls" etc. to see your files, use hadoop dfs -ls your files will actually be stored as follows: > Specify directories for dfs.name.dir and dfs.data.dir in conf/hadoop-site.xml. These are used to hold distr

Re: Question regarding reduce tasks

2008-11-03 Thread Miles Osborne
you can't guarantee that a reducer (or mapper for that matter) will be executed exactly once unless you turn-off preemptive scheduling. but, a distinct key gets sent to a single reducer, so yes, only one reducer will see a particulat key + associated values Miles 2008/11/3 Ryan LeCompte <[EMAIL

Re: Question regarding reduce tasks

2008-11-03 Thread Miles Osborne
writing the data for a > particular key to HDFS, it won't somehow get re-executed again for the > same key right? > > > On Mon, Nov 3, 2008 at 11:28 AM, Miles Osborne <[EMAIL PROTECTED]> wrote: >> you can't guarantee that a reducer (or mapper for that matter) wil

Re: reduce more than one way

2008-11-07 Thread Miles Osborne
why not just merge the two reducer 2008/11/7 Elia Mazzawi <[EMAIL PROTECTED]>: > Hello, > > I'm writing hadoop programs in Java, > I have 2 hadooop map/reduce programs that have the same map, but a different > reduce methods. > > can i run them in a way so that the map only happens once? > > maybe

Re: reading input for a map function from 2 different files?

2008-11-12 Thread Miles Osborne
unless you really care about getting exact averages etc, i would suggest simply sampling the input and computing your statistics from that --it will be a lot faster and you won't have to deal with under/overflow etc if your sample is reasonably large then your results will be pretty close to the

Re: Haddop Error Massage

2009-01-19 Thread Miles Osborne
that is a timing / space report Miles 2009/1/19 Deepak Diwakar : > Hi friends, > > could somebody tell me what does the following quoted massage mean? > > " 3154.42user 76.09system 44:47.21elapsed 120%CPU (0avgtext+0avgdata > 0maxresident)k > 0inputs+0outputs (15major+6092226minor)pagefaults 0swa

Re: Finding longest path in a graph

2009-01-29 Thread Miles Osborne
this is perhaps worth watching: http://www.youtube.com/watch?v=BT-piFBP4fE&feature=channel it deals with finding the shortest path in a graph using MR. here at work i don't have audio working so i'm not 100% sure that this is the best way to do it, but it is a start. Miles 2009/1/29 Mark Kerzn

Reducers stuck in Shuffle ...

2009-01-30 Thread Miles Osborne
i've been seeing a lot of jobs where large numbers of reducers keep failing at the shuffle phase due to timeouts (see a sample reducer syslog entry below). our setup consists of 8-core machines, with one box acting as both a slave and a namenode. the load on the namenode is not at full capacity s

Re: Finding longest path in a graph

2009-02-01 Thread Miles Osborne
one thing that people seem to routinely forget is HBase --random access for MR jobs. the basic problem with MR is that mappers are independent of each other and you need to marshall the overall flow to overcome this (for example, rekey items so that they become localised within a single mapper/red

Re: Finding small subset in very large dataset

2009-02-12 Thread Miles Osborne
Bloom Filters are one of the greatest things ever, so it is nice to see another application. Remember that your filter may make mistakes -you will see items that are not in the set. Also, instead of setting a single bit per item (in the A set), set k distinct bits. You can analytically work-out

Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne
just re-represent the associated data as a bit vector and set of hash functions. you then just copy this around, rather than the raw items themselves. Miles 2009/2/18 Thibaut_ : > > Hi, > > The bloomfilter solution works great, but I still have to copy the data > around sometimes. > > I'm still

Re: Finding small subset in very large dataset

2009-02-18 Thread Miles Osborne
reducer read in that big file only once. As all > the keys are sorted, I can get all the needed values in one big read step > (skipping those entries I don't need). > > > Thibaut > > > > Miles Osborne wrote: >> >> just re-represent the associated data as

Re: Splittable lzo files

2009-03-03 Thread Miles Osborne
that's very interesting. for us poor souls using streaming, would we be able to use it? (right now i'm looking at a 100+ GB gzipped file ...) Miles 2009/3/3 Johan Oskarsson : > Hi, > > thought I'd pass on this blog post I just wrote about how we compress our > raw log data in Hadoop using Lzo a

Re: OT: How to search mailing list archives?

2009-03-08 Thread Miles Osborne
posts tend to get indexed by Google, so try that Miles 2009/3/8 Stuart White : > This is slightly off-topic, and I realize this question is not > specific to Hadoop, but what is the best way to search the mailing > list archives?  Here's where I'm looking: > > http://mail-archives.apache.org/mod_

Re: how to preserve original line order?

2009-03-13 Thread Miles Osborne
associate with each line an identifier (eg line number) and afterwards resort the data by that Miles 2009/3/13 Roldano Cattoni : > The task should be simple, I want to put in uppercase all the words of a > (large) file. > > I tried the following: >  - streaming mode >  - the mapper is a perl scri

Re: Will Hadoop help for my application?

2009-03-19 Thread Miles Osborne
yes, this is perfectly fine: make each mapper one of your runs and simply emit the final result, along with the conditions leading to that result. you won't need any reducers. Miles 2009/3/19 John Bergstrom : > Hi, > > Can anyone tell me if Hadoop is appropriate for the following application. >

Re: Amazon Elastic MapReduce

2009-04-02 Thread Miles Osborne
... and only in the US Miles 2009/4/2 zhang jianfeng : > Does it support pig ? > > > On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel wrote: > >> >> FYI >> >> Amazons new Hadoop offering: >> http://aws.amazon.com/elasticmapreduce/ >> >> And Cascading 1.0 supports it: >> http://www.cascading.org/20

Re: Hardware - please sanity check?

2009-04-02 Thread Miles Osborne
make sure you also have a fast switch, since you will be transmitting data across your network and this will come to bite you otherwise (roughly, you need one core per hadoop-related job, each mapper, task tracker etc; the per-core memory may be too small if you are doing anything memory-intensiv

Re: Checking if a streaming job failed

2009-04-02 Thread Miles Osborne
here is how i do it (in perl). hadoop streaming is actually called by a shell script, which in this case expects compressed input and produces compressed output. but you get the idea: (the mailer had messed-up the formatting somewhat) > sub runStreamingCompInCompOut { my $mapper = shift @_;

Re: No space left on device Exception

2009-04-16 Thread Miles Osborne
it may be that intermediate results are filling your disks and when the jobs crash, this all gets deleted. so it would look like you have spare space when in reality you don't. i would check on the file system as your jobs run and see if indeed they are filling-up. Miles 2009/4/16 Rakhi Khatwan

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Miles Osborne
is your input data compressed? if so then you will get one mapper per file Miles 2009/4/21 javateck javateck : > Hi Koji, > > Thanks for helping. > > I don't know why hadoop is just using 2 out of 10 map tasks slots. > > Sure, I just cut and paste the job tracker web UI, clearly I set the max >

Re: mapred.tasktracker.map.tasks.maximum

2009-04-21 Thread Miles Osborne
they are the places to check. a job can itself over-ride the number of mappers and reducers. for example, using streaming, i often state the number of mappers and reducers i want to use: -jobconf mapred.reduce.tasks=30 this would tell hadoop to use 30 reducers, for example. if you don't have

Re: TextInputFormat unique key across files

2009-05-04 Thread Miles Osborne
if you can tolerate errors then a simple idea is to generate a random number in the range 0 ... 2 ^n and use that as the key. if the number of lines is small relative to 2 ^ n then with high probability you won't get the same key twice. Miles 2009/5/4 Rares Vernica : > Hello, > > TextInputFormat

Re: All keys went to single reducer in WordCount program

2009-05-07 Thread Miles Osborne
with such a small data set who knows what will happen: you are probably hitting minimal limits of some kind repeat this with more data Miles 2009/5/7 Foss User : > I have two reducers running on two different machines. I ran the > example word count program with some of my own System.out.printl

Re: hadoop performance with very small cluster

2009-05-21 Thread Miles Osborne
if you mean "hadoop does not give a speed-up compared with a sequential version" then this is because of overhead associated with running the framework: your job will need to be scheduled, JVMs instantiated, data copied, data sorted etc etc. if your jobs can be parallelised and you have enough ma