How does MergeQueue.merge actually sort from different segments ??

2010-06-18 Thread elton sky
Hello everyone, I am going thru source code of MapReduce. In MergeQueue.merge, I can only see the SEGMENTS are combined and sorted by length into a list for merge. However, I could not find the procedure to sort those (key, value) in segments by key... here is the function: 1. RawKeyValueIter

Set LOG.Info in FileInputFormat.java doesn't work!?

2010-06-18 Thread elton sky
Hello fellas, problem here. I am trying to get some runtime info in FileInputFormat.java and I try to use LOG.info(). However it never seems write anything into log file. I tried to grep the "logs" directory and could not find the info supposed to be printed out. Any idea on this? BTW, any other

Re: How does MergeQueue.merge actually sort from different segments ??

2010-06-20 Thread elton sky
Yu wrote: > Please note this: > > private static class MergeQueue > extends PriorityQueue> implements RawKeyValueIterator { > > priority queue is used to accomplish sorting. > > On Fri, Jun 18, 2010 at 8:14 PM, elton sky wrote: > > > Hello everyone, &g

Re: Set LOG.Info in FileInputFormat.java doesn't work!?

2010-06-20 Thread elton sky
TY > through snmp metrics by using snmp4j > > On Fri, Jun 18, 2010 at 8:15 PM, elton sky wrote: > > > Hello fellas, > > > > problem here. I am trying to get some runtime info in > FileInputFormat.java > > and I try to use LOG.info(). However it never seems wr

Can we modify existing file in HDFS?

2010-06-22 Thread elton sky
hello everyone, I noticed there are 6 operations in HDFS: OP_WRITE_BLOCK OP_READ_BLOCK OP_READ_METADATA OP_REPLACE_BLOCK OP_COPY_BLOCK OP_BLOCK_CHECKSUM and As I know there's no way to modify some arbitrary part in a existing file in HDFS. So what if I create a say, 2 Petabytes, file and like to

Could not get FileSystem obj, get java.lang.NullPointerException !!

2010-06-23 Thread elton sky
Hi, I am new to hadoop programming. I am trying to copy a local file to HDFS. My code snippet is: . . Configuration conf = new Configuration(); InputStream in=null; OutputStream out = null; try { in = new BufferedInputStream(new FileInputStream(src));

Re: Could not get FileSystem obj, get java.lang.NullPointerException !!

2010-06-24 Thread elton sky
Thanks for reply, On Thu, Jun 24, 2010 at 7:35 PM, Steve Loughran wrote: > elton sky wrote: > >> Hi, >> I am new to hadoop programming. >> > > > OK. First hint of advice: don't spam the dev lists with user problems, you > will only get ignored

Problem with calling FSDataOutputStream.sycn() ~

2010-06-25 Thread elton sky
Hello, I am trying some simple code snippet to create a new file. And after create and write to the file, I want to use "sync()" to synchronize all replicas. However, I got "LeaseExpiredException" in FSNameSystem.checkLease(): my code: . . InputStream in=null; OutputStream out = null;

Re: Problem with calling FSDataOutputStream.sycn() ~

2010-06-27 Thread elton sky
Hey Ted, >The line numbers don't match those from hadoop 0.20.2 >What version are you using ? I am using 0.20.2. I add some extra LOG or print lines in between which makes the line number different from usual~ But my modification are just print out for debugging this problem. >I don't see how co

problem with rack-awareness

2010-07-01 Thread elton sky
hello, I am trying to separate my 6 nodes onto 2 different racks. For test purpose, I wrote a bash file which smply returns "rack0" all the time. And I add property "topology.script.file.name" in core-site.xml. When I restart by start-dfs.sh, the namenode could not find any datanode at all. All d

NotificationTestCase -- "Child"s still running after return

2010-07-07 Thread elton sky
hello all, I was running a test in hadoop source code, test/org/apache/hadoop/mapred/NotificationTestCase.java. I ran it with TestClusterMRNotification. In my driver main() function: " TestClusterMRNotification t = new TestClusterMRNotification(); t.setUp(); t.testMR(); t.tearDown(); " This pr

io.file.buffer.size, how hadoop uses it?

2010-07-28 Thread elton sky
I am a bit confused of how this attribute is used. My understanding is it's related with file read/write. And I can see, in LineReader.java, it's used as the default buffer size for each line; in BlockReader.newBlockReader(), it's used as the internal buffer size of the BufferedInputStream. Also,

how io.file.buffer.size works?

2010-07-29 Thread elton sky
I think my question is ignored, so just post it again: I am a bit confused of how this attribute is used. My understanding is it's related with file read/write. And I can see, in LineReader.java, it's used as the default buffer size for each line; in BlockReader.newBlockReader(), it's used as the

how dfs.write.packet.size impact write through put of HDFS, strange result?

2010-08-02 Thread elton sky
Hello everyone, I am doing some evaluation on my 6 nodes mini cluster. Each node has 4 core Intel(R) Xeon(R) CPU 5130 @ 2.00GHz, 8GB memory, 500 GB disk, running Linux version 2.6.18-164.11.1.el5(Red Hat 4.1.2-46). I was trying to use different packet size (dfs.write.packet.size) and bytePerChun

Read() block mysteriously when using big BytesPerChecksum size

2010-10-07 Thread elton sky
Hello experts, I was benchmarking sequential write throughput of HDFS. For testing affect of bytesPerChecksum (bpc) size to write performance, I am using different bpc size: 2M, 256K, 32K, 4K, 512B. My cluster has 1 name node and 5 data nodes. They are xen VMs and each of them configured with 56

increase BytesPerChecksum decrease write performance??

2010-10-08 Thread elton sky
Hello, I was benchmarking write/read of HDFS. I changed the chunksize, i.e. bytesPerChecksum or bpc, and create a 1G file with 128MB block size. The bpc I used: 512B, 32KB, 64KB, 256KB, 512KB, 2MB, 8MB. The result surprised me. The performance for 512B, 32KB, 64KB are quite similar, and then, as

Why hadoop is written in java?

2010-10-09 Thread elton sky
I always have this question but couldn't find proper answer for this. For system level applications, c/c++ is preferable. But why this one using java?

BUG: Anyone use block size more than 2GB before?

2010-10-18 Thread elton sky
Hello, In hdfs.org.apache.hadoop.hdfs.DFSClient .DFSOutputStream.writeChunk(byte[] b, int offset, int len, byte[] checksum) The second last line: int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize); When I use blockSize bigger than 2GB, which is out of the boundary of integer

Re: BUG: Anyone use block size more than 2GB before?

2010-10-18 Thread elton sky
>Why would you want to use a block size of > 2GB? For keeping a maps input split in a single block~ On Tue, Oct 19, 2010 at 9:07 AM, Michael Segel wrote: > > Ok, I'll bite. > Why would you want to use a block size of > 2GB? > > > > > Date: Mon, 18 Oct 2010 21:33:34 +1100 > > Subject: BUG: Anyone

Re: BUG: Anyone use block size more than 2GB before?

2010-10-18 Thread elton sky
I am curious, any specific reason to make it smaller than 2**31? On Tue, Oct 19, 2010 at 10:27 AM, Owen O'Malley wrote: > Block sizes larger than 2**31 are known to not work. I haven't ever tracked > down the problem, just set my block size to be smaller than that. > > -- Owen >

Error when try to build mumak

2010-10-21 Thread elton sky
Hello, I am trying to build mumak and have a bite, with "ant package" in mapred/ dir. I got error: [ivy:resolve] :: problems summary :: [ivy:resolve] WARNINGS [ivy:resolve] module not found: org.apache.hadoop#hadoop-common;0.21.0 [ivy:resolve] apache-snapshot: tried [ivy:resolve] https:

Re: BUG: Anyone use block size more than 2GB before?

2010-10-21 Thread elton sky
Milind, You are right. But that only happens when your client is one of the data nodes in HDFS. otherwise a random node will be picked up for the first replica. On Fri, Oct 22, 2010 at 3:37 PM, Milind A Bhandarkar wrote: > If a file of say, 12.5 GB were produced by a single task with replication

Re: Read() block mysteriously when using big BytesPerChecksum size

2010-11-11 Thread elton sky
d come up again. I ll try the blockdev as well, see if any improve though On Thu, Nov 11, 2010 at 6:32 AM, Scott Carey wrote: > > On Oct 7, 2010, at 2:35 AM, elton sky wrote: > > > Hello experts, > > > > I was benchmarking sequential write throughput of

why quick sort when spill map output?

2011-02-28 Thread elton sky
Hello forumers, Before spill the data in kvbuffer to local disk in map task, k/v are sorted using quick sort. The complexity of quick sort is O(nlogn) and worst case is O(n^2). Why using quick sort? Regards

Re: Do Mappers run on different machines?

2011-03-03 Thread elton sky
Well, I think you are asking : if you have 3 machines, and you want to start 3 maps, one for each input file, will each maps reside on each different machine? The answer is: not necessarily. In responding heartbeat from a task tracker, task scheduler tries to assign one local map task for each job

Re: We are looking to the root of the problem that caused us IOException

2011-04-05 Thread elton sky
check the FAQ ( http://wiki.apache.org/hadoop/FAQ#What_does_.22file_could_only_be_replicated_to_0_nodes.2C_instead_of_1.22_mean.3F ) On Tue, Apr 5, 2011 at 4:53 PM, Guy Doulberg wrote: > Hey guys, > > We are trying to figure out why many of our Map/Reduce job on the cluster > are failing. > In lo

Re: How do I split input on fixed length keys

2011-04-05 Thread elton sky
Agree with Harsh, I think you need to write your own RecordRead. On Tue, Apr 5, 2011 at 3:37 PM, Harsh Chouraria wrote: > Hello Kevin, > > On Fri, Mar 25, 2011 at 12:52 AM, wrote: > > -Dstream.map.output.field.separator= \ > > -Dstream.num.map.output.key.fields=13 \ > > > > I have searched th

Applications creates bigger output than input?

2011-04-29 Thread elton sky
One of assumptions map reduce made, I think, is that size of map's output is smaller than input. Although we can see many applications have the same size of output with input, like, sort, merge,etc. For my benchmark purpose, I am looking for some non-trivial, real life applications which creates *b

Re: Applications creates bigger output than input?

2011-04-29 Thread elton sky
kload for spill and merge on map's local disk is heavy. -Elton On Sat, Apr 30, 2011 at 11:22 AM, Owen O'Malley wrote: > On Fri, Apr 29, 2011 at 5:02 AM, elton sky wrote: > > > For my benchmark purpose, I am looking for some non-trivial, real life > > applications which

questions about hadoop map reduce and compute intensive related applications

2011-04-30 Thread elton sky
I got 2 questions: 1. I am wondering how hadoop MR performs when it runs compute intensive applications, e.g. Monte carlo method compute PI. There's a example in 0.21, QuasiMonteCarlo, but that example doesn't use random number and it generates psudo input upfront. If we use distributed random num

Re: questions about hadoop map reduce and compute intensive related applications

2011-04-30 Thread elton sky
Ted, MPI supports node-to-node communications in ways that map-reduce does not, > however, which requires that you iterate map-reduce steps for many > algorithms. With Hadoop's current implementation, this is horrendously > slow (minimum 20-30 seconds per iteration). > > Sometimes you can avoid

how do I keep reduce tmp files in mapred.local.dir

2011-05-10 Thread elton sky
hello all, I am trying to keep the output and copied files for reduce tasks after a job finishes. I commented out many "remove", "delete" kind of code, from TaskTracker, Task, etc, but still can not keep them. Any idea?

Re: Applications creates bigger output than input?

2011-05-19 Thread elton sky
intensive in map phase, for sampling all possible combinations of items. I am still looking for more applications, which creates bigger output and not CPU bound. Any further idea? I appreciate. On Tue, May 3, 2011 at 3:10 AM, Steve Loughran wrote: > On 30/04/2011 05:31, elton sky wr

Re: Applications creates bigger output than input?

2011-05-20 Thread elton sky
"this" > "this" "is" > "is" "a" > "a" "test" > ... > > > You may also be extracting all kinds of other features form the text, but > the tokenization/n-gram is not that CPU intensive. > > --Bo

application only require to process partial dataset?

2011-06-01 Thread elton sky
I am looking for applications can be implemented in MR, for which, in order to get final result, only partial dataset is required to be processed. (i.e. don't have to process the whole dataset). The Only thing I can think about is search top K records (without ordering), like in sql query. Is the

Why inter-rack communication in mapreduce slow?

2011-06-06 Thread elton sky
hello everyone, As I don't have experience with big scale cluster, I cannot figure out why the inter-rack communication in a mapreduce job is "significantly" slower than intra-rack. I saw cisco catalyst 4900 series switch can reach upto 320Gbps forwarding capacity. Connected with 48 nodes with 1Gb

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread elton sky
28 PM, Steve Loughran wrote: > On 06/06/11 08:22, elton sky wrote: > >> hello everyone, >> >> As I don't have experience with big scale cluster, I cannot figure out why >> the inter-rack communication in a mapreduce job is "significantly" slower >> tha

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread elton sky
andwidth to push all of those ports at full tilt between > racks. That's why Hadoop has the ability to take advantage of rack > locality. It will try to schedule I/O local to a rack where it's less > likely to block. > > -Joey > > On Mon, Jun 6, 2011 at 7:04 AM, elton sky

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread elton sky
Hi John, Because for map task, job tracker tries to assign them to local data nodes, so there' not much n/w traffic. Then the only potential issue will be, as you said, reducers, which copies data from all maps. So in other words, if the application only creates small intermediate output, e.g. gre

Re: Why inter-rack communication in mapreduce slow?

2011-06-06 Thread elton sky
text processing purpose. (i.e. few stages, low dependency, > > highly parallel). > > > > Its when one tries to solve general purpose algorithms of modest > > complexity that map/reduce gets into I/O churning problems. > > > > On Mon, 6 Jun 2011 23:58:53 +1000, elton