RE: problem with IdentityMapper

2008-01-10 Thread Joydeep Sen Sarma
class). From: Mike Forrest [mailto:[EMAIL PROTECTED] Sent: Thu 1/10/2008 3:20 PM To: hadoop-user@lucene.apache.org Subject: Re: problem with IdentityMapper I'm using Text for the keys and MapWritable for the values. Joydeep Sen Sarma wrote: > what are the key value types in the

RE: problem with IdentityMapper

2008-01-10 Thread Joydeep Sen Sarma
what are the key value types in the Sequencefile? seems that the maprunner calls createKey and createValue just once. so if the value serializes out it's entire memory allocated (and not what it last read) - it would cause this problem. (I have periodically shot myself in the foot with this b

RE: Question on running simultaneous jobs

2008-01-10 Thread Joydeep Sen Sarma
neous jobs Joydeep Sen Sarma wrote: > being paged out is sad - but the worst case is still no worse than killing > the job (where all the data has to be *recomputed* back into memory on > restart - not just swapped in from disk) In my experience, once a large process is paged out

RE: Question on running simultaneous jobs

2008-01-10 Thread Joydeep Sen Sarma
are blessed to be in this state). From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thu 1/10/2008 2:24 PM To: hadoop-user@lucene.apache.org Subject: Re: Question on running simultaneous jobs Joydeep Sen Sarma wrote: > can we suspend jobs (just unix susp

RE: Question on running simultaneous jobs

2008-01-10 Thread Joydeep Sen Sarma
can we suspend jobs (just unix suspend) instead of killing them? if we can - perhaps we don't even have to bother delaying the use of additional slots beyond limit. From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Thu 1/10/2008 11:21 AM To: hadoop-user@lucene

RE: Question on running simultaneous jobs

2008-01-10 Thread Joydeep Sen Sarma
this may be simple - but is this the right solution? (and i have the same concern about hod) if the cluster is unused - why restrict parallelism? if someone's willing to wake up at 4am to beat the crowd - they would just absolutely hate this. -Original Message- From: Doug Cutting [mail

RE: Question on running simultaneous jobs

2008-01-09 Thread Joydeep Sen Sarma
> that can run(per job) at any given time. not possible afaik - but i will be happy to hear otherwise. priorities are a good substitute though. there's no point needlessly restricting concurrency if there's nothing else to run. if there is something else more important to run - then in most

RE: Limit the space used by hadoop on a slave node

2008-01-08 Thread Joydeep Sen Sarma
pace used by hadoop on a slave node >> >> >> I think I have seen related bad behavior on 15.1. >> >> On 1/8/08 11:49 AM, "Hairong Kuang" <[EMAIL PROTECTED]> wrote: >> >>> Has anybody tried 15.0? Please check >>> https://issue

RE: Limit the space used by hadoop on a slave node

2008-01-08 Thread Joydeep Sen Sarma
at least up until 14.4, these options are broken. see https://issues.apache.org/jira/browse/HADOOP-2549 (there's a trivial patch - but i am still testing). -Original Message- From: Khalil Honsali [mailto:[EMAIL PROTECTED] Sent: Tue 1/8/2008 11:21 AM To: hadoop-user@lucene.apache.org Sub

RE: missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
Block files are not needed of course. In any case I am interested in how it happened and why automatic recovery is not happening. Do you have any log messages from the time the data-node first failed? Was it upgrading at that time? Any information would be useful. Thank you, --Konstantin Joy

RE: missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
never mind. the storageID is logged in the namenode logs. i am able to restore the version files and add the datanodes back. phew. -Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tue 1/8/2008 10:11 AM To: hadoop-user@lucene.apache.org; hadoop-user

RE: missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tuesday, January 08, 2008 9:34 AM To: hadoop-user@lucene.apache.org; hadoop-user@lucene.apache.org Subject: RE: missing VERSION files leading to failed datanodes well - at least i know why this happened. (still looking for a way to restor

RE: missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
first partition is always selected. The free space parameters do not appear to be honored in any case. The good news is that aggressive rebalancing seems to put things in the right place. On 1/8/08 9:34 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > well - at least i know wh

RE: missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
(DataStorage.java:146) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:243) -Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tue 1/8/2008 8:51 AM To: hadoop-user@lucene.apache.org Subject: missing VERSION files leading to failed datanodes 2008-01-08

missing VERSION files leading to failed datanodes

2008-01-08 Thread Joydeep Sen Sarma
2008-01-08 08:36:20,045 ERROR org.apache.hadoop.dfs.DataNode: org.apache.hadoop.dfs.InconsistentFSStateException: Directory /var/hadoop/tmp/dfs/data is in an inconsistent state: file VERSION is invalid. [EMAIL PROTECTED] data]# ssh hadoop003.sf2p cat /var/hadoop/tmp/dfs/data/current/VERSION [

RE: Is there an rsyncd for HDFS

2008-01-02 Thread Joydeep Sen Sarma
hdfs doesn't allow random overwrites or appends. so even if hdfs were mountable - i am guessing we couldn't just do a rsync to a dfs mount (never looked at rsync code - but assuming it does appends/random-writes). any emulation of rsync would end up having to delete and recreate changed files in

RE: Appropriate use of Hadoop for non-map/reduce tasks?

2007-12-25 Thread Joydeep Sen Sarma
in many cases - long running tasks are of low cpu util. i have trouble imagining how these can mix well with cpu intensive short/batch tasks. afaik - hadoop's job scheduling is not resource usage aware. long background tasks would consume per-machine task slots that would block out other tasks f

RE: DFS Block Allocation

2007-12-20 Thread Joydeep Sen Sarma
i presume you meant that the act of 'mounting' itself is not bad - but letting the entire cluster start reading from a hapless filer is :-) i have actually found it very useful to upload files though map-reduce. we have periodic jobs that are in effect tailing nfs files and copying data to hdfs

RE: Some Doubts of hadoop functionality

2007-12-20 Thread Joydeep Sen Sarma
agreed - i think for anyone who is thinking of using hadoop as a place from where data is served - has to be distrubed by lack of data protection. replication in hadoop provides protection against hardware failures. not software failures. backups (and depending on how they are implemented - s

RE: Re: point in time snapshot

2007-12-19 Thread Joydeep Sen Sarma
This one of the hottest issues for us as well. Only thing we have been able to think of is running on Solaris+ZFS and doing coordinated snapshots of the underlying filesystems. Not sure if even that would work (since the membership of the cluster can change from the time the snapshot was taken to

RE: advanced map/reduce tutorials?

2007-12-14 Thread Joydeep Sen Sarma
brute force: let the input be splittable. in each map job, open the original file and for each line in the split, iterate over all preceding lines in the input file. this will at least get u the parallelism. but a better approach would be try and cast ur problem as a sorting/grouping problem. d

RE: finalize upgrade

2007-12-12 Thread Joydeep Sen Sarma
it consumes real space though. we were disk full on the drive hosting control/tmp data and got space back once the finalizeUpgrade finished .. -Original Message- From: Doug Cutting [mailto:[EMAIL PROTECTED] Sent: Wed 12/12/2007 11:14 AM To: hadoop-user@lucene.apache.org Subject: Re: fina

RE: finalizeUpgrade on 0.14.4

2007-12-11 Thread Joydeep Sen Sarma
Never mind - things did cleanup eventually. Would be nice if it logged something to the log file. -Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tuesday, December 11, 2007 3:58 PM To: hadoop-user@lucene.apache.org Subject: finalizeUpgrade on 0.14.4 Hi folks

finalizeUpgrade on 0.14.4

2007-12-11 Thread Joydeep Sen Sarma
Hi folks, The finalizeUpgrade command does not seem to be doing anything for us. Some drives of the datanodes are 100% disk full due to the 'previous' folder. Namenode logs say: 2007-12-11 14:39:48,158 INFO org.apache.hadoop.dfs.Storage: Finalizing upgrade for storage directory /var/hadoop

RE: Mapper Out of Memory

2007-12-06 Thread Joydeep Sen Sarma
Can control heap size using 'mapred.child.java.opts' option. Check ur program logic though. Personal experience is that running out of heap space in map task usually suggests some runaway logic somewhere. -Original Message- From: Rui Shi [mailto:[EMAIL PROTECTED] Sent: Thursday, December

RE: more on reduce copy speed

2007-12-05 Thread Joydeep Sen Sarma
there's an open bug on stalled/slow reduce copy speeds (don't have the number handy). but it's targeted for 0.16 from what i remember. we get stuck in this once in a while and killing the reduce task fixes things (admittedly, not a real solution). speculative execution might help as well - but

RE: starting merges before shuffle completion

2007-11-21 Thread Joydeep Sen Sarma
rk. Have you observed anything different? We > may have a bug or 3 to fix here. > > > Joydeep Sen Sarma wrote: >> Hi folks, >> >> >> >> I searched around JIRA and didn't find anything that resembled this. Is >> this something on the roadmap? >

starting merges before shuffle completion

2007-11-19 Thread Joydeep Sen Sarma
Hi folks, I searched around JIRA and didn't find anything that resembled this. Is this something on the roadmap? For normal aggregations, this is never an issue. But in some cases (typically joins) - map phase can emit lot of data and take quite a bit of time doing it. Meanwhile the reducer

RE: Hadoop 0.15.0 - Reporter issue w/ timing out

2007-11-10 Thread Joydeep Sen Sarma
Did anyone consider the impact of making such a change on existing applications? Curious how it didn't fail any regression test? (the pattern that is reported to be broken is so common). (I suffer from upgradephobia and this doesn't help) -Original Message- From: Doug Cutting [mailto:[EMA

RE: Tech Talk: Dryad

2007-11-09 Thread Joydeep Sen Sarma
I think we have to thing harder about how to address the problems with managing errors and keeping track of too much state/rolling back etc. This field is new to me - but I do remember from grad school that checkpointing is a very relevant and researched topic in parallel computing in general (whic

RE: sort speeds under java, c++, and streaming

2007-11-08 Thread Joydeep Sen Sarma
ort speeds under java, c++, and streaming On Nov 8, 2007, at 5:35 PM, Joydeep Sen Sarma wrote: > Doesn't the sorting and merging all still happen in Java-land? Yes, that is why it surprised me. -- Owen

RE: sort speeds under java, c++, and streaming

2007-11-08 Thread Joydeep Sen Sarma
Doesn't the sorting and merging all still happen in Java-land? -Original Message- From: Owen O'Malley [mailto:[EMAIL PROTECTED] Sent: Thursday, November 08, 2007 5:03 PM To: hadoop-user@lucene.apache.org Subject: sort speeds under java, c++, and streaming I set up a little benchmark on a

RE: performance of multiple map-reduce operations

2007-11-06 Thread Joydeep Sen Sarma
ubject: Re: performance of multiple map-reduce operations On 11/6/07, Doug Cutting <[EMAIL PROTECTED]> wrote: > > Joydeep Sen Sarma wrote: > > One of the controversies is whether in the presence of failures, this > > makes performance worse rather than better (kind of li

RE: performance of multiple map-reduce operations

2007-11-06 Thread Joydeep Sen Sarma
This has come up a few times. There was an interesting post a while back on a prototype to chain map-reduce jobs together - which is what u are looking for really. See: http://www.mail-archive.com/hadoop-user@lucene.apache.org/msg02773.html curious how mature this prototype is and any plans to in

RE: Very weak mapred performance on small clusters with a massive amount of small files

2007-11-06 Thread Joydeep Sen Sarma
Would it help if the multifileinputformat bundled files into splits based on their location? (wondering if remote copy speed is a bottleneck in map) If you are going to access the files many times after they are generated - writing a job to bundle data once upfront may be worthwhile. -Origi

RE: problems reading compressed sequencefiles in streaming (0.13.1)

2007-10-26 Thread Joydeep Sen Sarma
. SequenceFileAsTextInputFormat Runping > -Original Message- > From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] > Sent: Friday, October 26, 2007 12:30 AM > To: hadoop-user@lucene.apache.org > Subject: problems reading compressed sequencefiles in streaming (0.13.1) > >

problems reading compressed sequencefiles in streaming (0.13.1)

2007-10-26 Thread Joydeep Sen Sarma
I was hoping to use -inputformat SequenceFileAsTextInputFormat to process compressed sequencefiles in streaming jobs However, using a python mapper that just echoes out each line as it gets, and numreducetasks=0 - here's what the streaming job output looks like: SEQ^F org.apache.hadoop.io

repeated reduce task timeouts (false alarms)

2007-10-20 Thread Joydeep Sen Sarma
Running 0.13.1 - running into this very predictably (some tasks seem to keep timing out). The pattern is like this: - tasktracker says reduce task is not responding: 2007-10-20 18:40:28,225 INFO org.apache.hadoop.mapred.TaskTracker: task_0006_r_00_38 0.0% reduce > copy >

RE: HDFS vs. CIFS

2007-10-15 Thread Joydeep Sen Sarma
Not a valid comparison. CIFS is a remote file access protocol only. HDFS is a file system (that comes bundled with a remote file access protocol). It may be possible to build a CIFS gateway for HDFS. One interesting point of comparison at the protocol level is the level of parallelism. Compared t

RE: HBase performance

2007-10-12 Thread Joydeep Sen Sarma
As Doug pointed out - Vertica is for warehouse processing, HBase for real-time online processing. Compression of on-disk data helps in former case case since the queries scan large amounts of data and disk/bus/memory serial bandwidth bottlenecks are common. It's akin to map-reduce. Also data sizes

RE: large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
. It isn't hard. But it also definitely isn't obvious. I can't wait to get Pig out there so we don't have to know all of this. On 10/11/07 2:32 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote: > great! Didn't realize that the iterator was disk based

RE: large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
other is to do write the values out to disk to do a merge sort, then read the sorted values in sequentially. It would be nice if somebody can contribute a patch. Runping > -Original Message- > From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] > Sent: Thursday, October 11, 200

large reduce group sizes

2007-10-11 Thread Joydeep Sen Sarma
Hi all, I am facing a problem with aggregations where reduce groups are extremely large. It's a very common usage scenario - for example someone might want the equivalent of 'count (distinct.field2) from events e group by e.field1'. the natural thing to do is emit e.field1 as the map-key a

RE: Multi-threaded Reduce

2007-10-04 Thread Joydeep Sen Sarma
I got the impression that the original question was about multithreadedmaprunner.java and how something like that could be implemented in the reduce phase as well. Nguyen - ur code looks alright - but there's no limit on the number of threads u would end up spawning (something that the maprunner a

RE: hardware specs for hadoop nodes

2007-09-25 Thread Joydeep Sen Sarma
Am curious how folks are sizing memory for Task nodes. It didn't seem to me that either map (memory needed ~ chunk size) or reduce (memory needed ~ io.sort.mb - yahoo's benchmark sort run sets it to low hundreds) tasks consumed a lot of memory in the normal course of affairs. (there could be exce

RE: Hadoop User Get Together SF Bay Area

2007-09-24 Thread Joydeep Sen Sarma
+1 -Original Message- From: Milind Bhandarkar [mailto:[EMAIL PROTECTED] Sent: Monday, September 24, 2007 3:52 PM To: hadoop-user@lucene.apache.org Subject: Re: Hadoop User Get Together SF Bay Area We are definitely interested in such an informal get-together, and depending on the interes

RE: Multiple output files, and controlling output file name...

2007-09-21 Thread Joydeep Sen Sarma
Why don't u create/write to hdfs files directly from reduce job (don't depend on the default reduce output dir/files)? Like the cases where input is not homogenous, this seems (at least to me) to be another common pattern (output is not homogenous). I have run into this when loading data into ha

RE: rack-awareness for hdfs

2007-09-18 Thread Joydeep Sen Sarma
I used to think that the notion of rack would be useful to exploit in a rack-level combiner (aggregate data before shipping off rack) But apparently Goog doesn't do this (at least what some people told me). Any thoughts on this list? -Original Message- From: Doug Cutting [mailto:[EMAIL

RE: ipc.client.timeout

2007-09-13 Thread Joydeep Sen Sarma
Thanks, dhruba -Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Thursday, September 13, 2007 2:16 PM To: hadoop-user@lucene.apache.org Subject: RE: ipc.client.timeout Learning the hard way :-) Second Ted's last mail (all the way back to Sun RPC - serv

RE: ipc.client.timeout

2007-09-13 Thread Joydeep Sen Sarma
2007 1:54 PM To: hadoop-user@lucene.apache.org Subject: Re: ipc.client.timeout Joydeep Sen Sarma wrote: > Quite likely it's because the namenode is also a data/task node. That doesn't sound like a "best practice"... Doug

RE: ipc.client.timeout

2007-09-13 Thread Joydeep Sen Sarma
calls. Can you tell me if a CPU bottleneck on your Namenode is causing you to encounter all these timeout? Thanks, dhruba -----Original Message- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Thursday, September 13, 2007 12:14 PM To: hadoop-user@lucene.apache.org Subject: RE: ip

RE: ipc.client.timeout

2007-09-13 Thread Joydeep Sen Sarma
gh. Setting the timeout to a large value ensures that RPCs won't timeout that often and thereby potentially lead to lesser failures (for e.g., a map/reduce task kills itself when it fails to invoke an RPC on the tasktracker for three times in a row) and retries. > -Original Message----- &g

RE: Multiple HDFS paths and Multiple Masters...

2007-09-13 Thread Joydeep Sen Sarma
I used to work in Netapp HA group - so can explain the single drive failure stuff a bit (although the right forum is the [EMAIL PROTECTED] mailing list). The shelves are supposed to bypass failed drives (the shelf re-routes the FC loop when it detects failed drives). However, there were rare drive

RE: JOIN-type operations with Hadoop...

2007-09-13 Thread Joydeep Sen Sarma
We use the directory namespace to distinguish different types of files. Wrote a simple wrapper around TextInputFormat/SequenceFileInputFormat - such that they key returned is the pathname (or some component of the pathname). That way u can look at the key - and then decide what kind of record struc

RE: Overhead of Java?

2007-09-06 Thread Joydeep Sen Sarma
Came across an interesting site on this topic: http://shootout.alioth.debian.org/ There's a 'sum-file' benchmark - that's probably what a lot of people do in hadoop. The difference across different processors is interesting. -Original Message- From: Curt Cox [mailto:[EMAIL PROTECTED] Sen

RE: Sort benchmark on 2000 nodes

2007-09-05 Thread Joydeep Sen Sarma
It will be very useful to see the hadoop/job config settings and get some sense of the underlying hardware config. -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 05, 2007 2:29 AM To: hadoop-user@lucene.apache.org Subject: Sort benchmark on 2000 n

ipc.client.timeout

2007-09-04 Thread Joydeep Sen Sarma
The default is set to 60s. many of my dfs -put commands would seem to hang - and lowering the timeout (to 1s) seems to have made things a whole lot better. General curiosity - isn't 60s just huge for a rpc timeout? (a web search indicates that nutch may be setting it to 10s - and even that see

RE: Re: Compression using Hadoop...

2007-08-31 Thread Joydeep Sen Sarma
using Hadoop... Isn't that what the distcp script does? Thanks, Stu -Original Message- From: Joydeep Sen Sarma Sent: Friday, August 31, 2007 3:58pm To: hadoop-user@lucene.apache.org Subject: Re: Compression using Hadoop... One thing I had done to speed up copy/put speeds was write a s

Re: Compression using Hadoop...

2007-08-31 Thread Joydeep Sen Sarma
One thing I had done to speed up copy/put speeds was write a simple map-reduce job to do parallel copies of files from a input directory (in our case the input directory is nfs mounted from all task nodes). It gives us a huge speed-bump. It's trivial to roll ur own - but would be happy to share as

RE: looking for some help with pig syntax

2007-08-29 Thread Joydeep Sen Sarma
ss elegant imho.) From: Ted Dunning [mailto:[EMAIL PROTECTED] Sent: Wednesday, August 29, 2007 11:03 AM To: hadoop-user@lucene.apache.org; Joydeep Sen Sarma Cc: hadoop-user@lucene.apache.org Subject: RE: looking for some help with pig syntax How do you c

RE: looking for some help with pig syntax

2007-08-28 Thread Joydeep Sen Sarma
gt; d = foreach d generate group, c.b::$4; -- <1, {,,}> where <> represents a tuple and {} a bag. I'm not 100% sure of the syntax c.b::$4 for d, you may have to fiddle with that to get it right. Alan. Joydeep Sen Sarma wrote: > Will it? > > Trying an exampl

RE: secondary namenode errors

2007-08-28 Thread Joydeep Sen Sarma
ne.apache.org Subject: Re: secondary namenode errors Could you please describe what is exactly the problem with upgrade. If malfunctioned secondary name-node messes up with the image and/or edits files then we should fix the problem asap. Thanks, Konstantin Joydeep Sen Sarma wrote: >Ju

RE: looking for some help with pig syntax

2007-08-28 Thread Joydeep Sen Sarma
d, listOfId; t2 = load table2 as id, f1; t1a = foreach t1 generate flatten(listOfId); -- flattens the lisOfId into a set of ids b = join t1a by $0, t2 by id; -- join the two together. c = foreach b generate t2.id, t2.f1; -- project just the ids and f1 entries. Alan. Joydeep Sen Sarma wrote: > Speci

looking for some help with pig syntax

2007-08-27 Thread Joydeep Sen Sarma
Specifically, how can we express this query: Table1 contains: id, (list of ids) Table2 contains: id, f1 Where the Table1:list is a variable length list of foreign key (id) into Table2. We would like to join every element of Table1:list with corresponding Table2:id. Ie. The final output

RE: secondary namenode errors

2007-08-24 Thread Joydeep Sen Sarma
al Message----- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Friday, August 24, 2007 7:21 PM To: hadoop-user@lucene.apache.org Subject: RE: secondary namenode errors I wish I had read the bug more carefully - thought that the issue was fixed in 0.13.1. Of course not, the issue persis

RE: secondary namenode errors

2007-08-24 Thread Joydeep Sen Sarma
rowse/HADOOP-1076 In any case, as Raghu suggested, please use 0.13.1 and not 0.13. Koji Raghu Angadi wrote: > Joydeep Sen Sarma wrote: >> Thanks for replying. >> >> Can you please clarify - is it the case that the secondary namenode >> stuff only works in 0.13.1

RE: Poly-reduce?

2007-08-24 Thread Joydeep Sen Sarma
Would be cool to get an option to reduce replication factor for reduce outputs. Hard to buy the argument that there's gonna be no performance win with direct streaming between jobs. Currently reduce jobs start reading map outputs before all maps are complete - and I am sure this results in signifi

RE: secondary namenode errors

2007-08-24 Thread Joydeep Sen Sarma
y namenode actually works, then it will resulting all the replications set to 1. Raghu. Joydeep Sen Sarma wrote: > Hi folks, > > > > Would be grateful if someone can help us understand why our secondary > namenodes don't seem to be doing anything: > > >

secondary namenode errors

2007-08-23 Thread Joydeep Sen Sarma
Hi folks, Would be grateful if someone can help us understand why our secondary namenodes don't seem to be doing anything: 1. running 0.13.0 2. secondary namenode logs continuously spew: at org.apache.hadoop.ipc.Client.call(Client.java:471) at org.apache.hadoo

RE: Poly-reduce?

2007-08-23 Thread Joydeep Sen Sarma
Completely agree. We are seeing the same pattern - need a series of map-reduce jobs for most stuff. There are a few different alternatives that may help: 1. The output of the intermediate reduce phases can be written to files that are not replicated. Not sure whether we can do this through map-red

RE: missing combiner output

2007-08-21 Thread Joydeep Sen Sarma
Ah - never mind - the 'combiner output record' metric reported by mapred is lying. The reduce job does see all the records. (I guess this is a bug) -Original Message----- From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 21, 2007 5:30 PM To: h

missing combiner output

2007-08-21 Thread Joydeep Sen Sarma
Hi folks, I am a little puzzled by (what looks to me) is like records that I am emitting from my combiner - but that are not showing up under 'combine output records' (and seem to be disappearing). Here's some evidence: Mapred says: Combine input records 230,803,567 Combine output rec

RE: data nodes imbalanced

2007-08-17 Thread Joydeep Sen Sarma
nodes imbalanced Joydeep Sen Sarma wrote: > > The only thing is that I have often done 'dfs -put' of large files (from > a NFS mount) from this node. Would this case local storage to be > allocated by HDFS? Yes. Local node stores one copy as long as it has space, if it is also part of the cluster. Raghu.

data nodes imbalanced

2007-08-17 Thread Joydeep Sen Sarma
Hi folks, We had a weird thing where one of our data nodes was 100% disk full (all hdfs data) and the other nodes were uniformly 20% space utilizated. Just wondering if this is a bug or whether we made an operator error. The node in question is not one of primary or secondary namenodes. It'

RE: Specifying external jars in the classpath for Hadoop

2007-08-14 Thread Joydeep Sen Sarma
i found depositing required jars into the lib directory works just great (all those jars are prepended to the classpath by the hadoop script). Any flaws doing it this way? -Original Message- From: Eyal Oren [mailto:[EMAIL PROTECTED] Sent: Tuesday, August 14, 2007 12:45 AM To: hadoop-use

RE: extremely slow reduce jobs

2007-08-07 Thread Joydeep Sen Sarma
p-dev/200702.mbox/% [EMAIL PROTECTED] Thanks, Stu -Original Message- From: Joydeep Sen Sarma <[EMAIL PROTECTED]> Sent: Fri, August 3, 2007 3:17 pm To: hadoop-user@lucene.apache.org Subject: extremely slow reduce jobs I have a fairly simple job with a map, a local combiner and a reduc

extremely slow reduce jobs

2007-08-03 Thread Joydeep Sen Sarma
I have a fairly simple job with a map, a local combiner and a reduce. The combiner and the reduce do the equivalent of a group_concat (mysql). I have horrible performance in the reduce stage: - the map jobs are done - all the reduce jobs claim they are copying data - but the copy rate is abysma