Re: hadoop on EC2

2008-06-03 Thread James Moore
On Tue, Jun 3, 2008 at 5:04 PM, Andreas Kostyrka <[EMAIL PROTECTED]> wrote: > Plus to make it even more painful, you cannot easily run it with one simple > SOCKS server, because you need to defer DNS resolution to the inside the > cluster, because VM names do resolve to external IPs, while the webs

Re: setrep

2008-06-03 Thread lohit
>It seems that setrep won't force replication change to the specified number >immediately, it changed really slowly. just wondering if this is the expected >behavior? what's the rational for this behavior? is there way to speed it up? Yes, it wont force replication to be instant. Once you inc

setrep

2008-06-03 Thread Haijun Cao
It seems that setrep won't force replication change to the specified number immediately, it changed really slowly. just wondering if this is the expected behavior? what's the rational for this behavior? is there way to speed it up? Thanks Haijun

Re: hadoop on EC2

2008-06-03 Thread Andreas Kostyrka
Well, the basic "trouble" with EC2 is that clusters usually are not networks in the TCP/IP sense. This makes it painful to decide which URLs should be resolved where. Plus to make it even more painful, you cannot easily run it with one simple SOCKS server, because you need to defer DNS resoluti

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
Ok, I've tried it out, the example sort bombs exactly like streaming => http://heaven.kostyrka.org/test.log Any recommendations? Thanks, Andreas signature.asc Description: This is a digitally signed message part.

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 22:16:05 Andreas Kostyrka wrote: > On Tuesday 03 June 2008 21:00:49 Runping Qi wrote: > > ${hadoop} jar hadoop-0.17-examples.jar sort -m \ > > > > >    -r 88 \ > > >    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \ > > >    -outFormat org.apache.hadoop.mapred

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 21:00:49 Runping Qi wrote: > ${hadoop} jar hadoop-0.17-examples.jar sort -m \ > > >    -r 88 \ > >    -inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \ > >    -outFormat org.apache.hadoop.mapred.lib.NullOutputFormat \ > >    -outKey org.apache.hadoop.io.Text \ >

Re: Stackoverflow

2008-06-03 Thread Chris Douglas
Ah; you're right, of course. Sorry about that. -C On Jun 3, 2008, at 12:00 PM, Runping Qi wrote: Chris, Your version will use LongWritable as the map output key type, which changes the job nature completely. You should use ${hadoop} jar hadoop-0.17-examples.jar sort -m \ -r 88 \ -inFo

RE: Stackoverflow

2008-06-03 Thread Runping Qi
Chris, Your version will use LongWritable as the map output key type, which changes the job nature completely. You should use ${hadoop} jar hadoop-0.17-examples.jar sort -m \ >-r 88 \ >-inFormat org.apache.hadoop.mapred.KeyValueTextInputFormat \ >-outFormat org.apache.hadoop.mapred.

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 20:35:03 Chris Douglas wrote: > >> By "not exactly small, do you mean each line is long or that there > >> are many records? > > > > Well, not small in the meaning, that even I could get my boss to > > allow me to > > give you the data, transfering it might be painful. (E.g.

Re: Adding new disk to DNs - FAQ #15 clarification

2008-06-03 Thread Ted Dunning
You can also play with aggressive rebalancing. If you decommission the node before adding the disk, then the namenode will make sure that you don't have any data on that machine. Then when you restore the machine, it will fill the volumes more sanely than if you start with a full partition. In m

Re: Stackoverflow

2008-06-03 Thread Chris Douglas
By "not exactly small, do you mean each line is long or that there are many records? Well, not small in the meaning, that even I could get my boss to allow me to give you the data, transfering it might be painful. (E.g. the job that aborted had about 12M lines with with ~2.6GB data => the lin

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
Ok, a new dead job: ;( This time after 2.4GB/11,3M lines ;( Any idea what I could do debug this? (No idea how to go at debugging a Java process that is distributed and does GBs of data. How does one stabilize that kind of stuff to generate a reproducable situation?) Andresa signature.asc Des

Re: how to deserialize the contents of hadoop output (sequencefileoutputformat)

2008-06-03 Thread Chris Douglas
If your keys and values have meaningful toString methods, hadoop fs - text will print the contents to stdout. -C On Jun 3, 2008, at 3:17 AM, Lin Guo wrote: I am wondering whether it is possible to deserialize the keys and values in a hadoop output file where the output format is SequenceFi

Re: Adding new disk to DNs - FAQ #15 clarification

2008-06-03 Thread Konstantin Shvachko
This is an old problem. We use round-robin algorithm to determine which local volume (disk/partition) should a block be placed to. This does not work well in some cases including the one when a new volume is included. This was particularly discussed in http://issues.apache.org/jira/browse/HADOOP-

Re: Input Data from DB or Memory rather than HDFS

2008-06-03 Thread Owen O'Malley
On Jun 3, 2008, at 4:56 AM, smallufo wrote: What if my data come from DB or memory ? I should implement a DatabaseInputFormat implements InputFormatrowIndex , MyData value> , right ? Yes But , how to implement the getSplits() , and getRecordReader() ? I looks into the sample source code fo

Re: how to execute two consecutive map-reduce pairs?

2008-06-03 Thread Christophe Taton
Maybe you can check org.apache.hadoop.mapred.jobcontrol.* I did not try it myself but it looks like this is what you need. Cheers, Christophe On Tue, Jun 3, 2008 at 5:55 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > No. > > At least you need to call runJob twice. Typically, it is safer to create

Re: Adding new disk to DNs - FAQ #15 clarification

2008-06-03 Thread Ted Dunning
I have had problems with multiple volumes while using ancient versions of Hadoop. If I put the smaller partition first, I would get overfull partition because hadoop was allocating by machine rather than by partition. If you feel energetic, go ahead and try putting the smaller partition first in

hadoop 0.16.3 running with java 1.6

2008-06-03 Thread Martin Schaaf
Hi, are there any known issues with hadoop 0.16.3 and java 1.6? We have some hanging jobs. Thanks in advance. Bye, martin

Re: how to execute two consecutive map-reduce pairs?

2008-06-03 Thread Ted Dunning
No. At least you need to call runJob twice. Typically, it is safer to create two job configurations so you don't forget to change something from the first jobs. It isn't a big deal. Just do it! On Tue, Jun 3, 2008 at 8:31 AM, hong <[EMAIL PROTECTED]> wrote: > Hi all > > A job must be done in

Adding new disk to DNs - FAQ #15 clarification

2008-06-03 Thread Otis Gospodnetic
Hi, I'm about to add a new disk (under a new partition) to some existing DataNodes that are nearly full. I see FAQ #15: 15. HDFS. How do I set up a hadoop node to use multiple volumes? Data-nodes can store blocks in multiple directories typically allocated on different local disk drives. In o

how to execute two consecutive map-reduce pairs?

2008-06-03 Thread hong
Hi all A job must be done in two pairs of map reduce, That is, Map1==> reduce1 ==> map2==>reduce2. "==>" means the output file of left is the input of the right. To do that job, can I just create only one JobConf instance, and invoke JobClient.runJob(conf) once? Is there any similar examp

Re: how to deserialize the contents of hadoop output (sequencefileoutputformat)

2008-06-03 Thread Stuart Sierra
On Tue, Jun 3, 2008 at 6:17 AM, Lin Guo <[EMAIL PROTECTED]> wrote: > I am wondering whether it is possible to deserialize the keys and values in a > hadoop output file where the output format is SequenceFileOutputFormat. I wrote some code to do this, samples attached. -Stuart /* SeqKeyList.java -

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 08:35:10 Chris Douglas wrote: > > I have no Java implementation of my job, sorry. > > Since it's all in the map side, IdentityMapper/IdentityReducer is > fine, as long as both the splits and the number of reduce tasks are > the same. > > > The data is a representation for lo

RE: how to deserialize the contents of hadoop output (sequencefileoutputformat)

2008-06-03 Thread Chen, Young
Do you mean read output file which create by SequenceFile.createWriter? If so, maybe below code part will be useful. It reads out long integer number out from sequence file. SequenceFile.Reader reader = new SequenceFile.Reader(fileSys, inFile, jobConf); LongWritable numInside = new

Input Data from DB or Memory rather than HDFS

2008-06-03 Thread smallufo
Hi I hava a question , what if my data is not originally located from HDFS. What if my data come from DB or memory ? I should implement a DatabaseInputFormat implements InputFormat , right ? But , how to implement the getSplits() , and getRecordReader() ? I looks into the sample source code for a l

how to deserialize the contents of hadoop output (sequencefileoutputformat)

2008-06-03 Thread Lin Guo
I am wondering whether it is possible to deserialize the keys and values in a hadoop output file where the output format is SequenceFileOutputFormat. many thanks!