how dfs.write.packet.size impact write through put of HDFS, strange result

2010-08-02 Thread elton sky
Hello everyone, I am doing some evaluation on my 6 nodes mini cluster. Each node has 4 core Intel(R) Xeon(R) CPU 5130 @ 2.00GHz, 8GB memory, 500 GB disk, running Linux version 2.6.18-164.11.1.el5(Red Hat 4.1.2-46). I was trying to use different packet size (dfs.write.packet.size) and bytePerChun

HDFS: buffer before contacts Namenode?

2010-08-09 Thread elton sky
hello folks, I can see from the design doc of HDFS, says: client will buffer a block size worth of data before contacting namenode for data node info. This is a network throughput optimal way. However, I could not find this buffer processing procedure in source code. In DFSClient.DataStreamer, it

Re: HDFS: buffer before contacts Namenode?

2010-08-10 Thread elton sky
network-bound. Is this the reason? On Wed, Aug 11, 2010 at 2:55 AM, Hairong Kuang wrote: > DataNode only buffers a packet before it contacts NameNode for allocating > DataNodes to place the block. The doc you read might be too old. > > Hairong > > > On 8/9/10 7:14 PM, "el

remove "append" in 0.21?

2010-08-19 Thread elton sky
I heard some gossip about this. Is this true?

increase BytesPerChecksum decrease write performance??

2010-10-08 Thread elton sky
Hello, I was benchmarking write/read of HDFS. I changed the chunksize, i.e. bytesPerChecksum or bpc, and create a 1G file with 128MB block size. The bpc I used: 512B, 32KB, 64KB, 256KB, 512KB, 2MB, 8MB. The result surprised me. The performance for 512B, 32KB, 64KB are quite similar, and then, as

Re: increase BytesPerChecksum decrease write performance??

2010-10-11 Thread elton sky
to datanodes in packets. The default packet size is 64K. If the > chunksize is bigger than 64K, the packet size automatically adjusts to > include at least one chunk. > > Please set the packet size to be 8MB by configuring > dfs.client-write-packet-size (in trunk) and rerun your experiment

Re: datanode memory requirement

2011-03-09 Thread elton sky
I don't think data node needs much memory, as Stu suggested. There will be requirement for memory when running a map reduce job. In that case, more memory is better for in memory sorting (on map) and in-memory copy (on reduce). Namenode needs memory to hold meta data though. On Thu, Mar 10, 201

Re: Could not obtain block

2011-03-10 Thread elton sky
>Caused by: java.io.IOException: Could not obtain block: blk_-3695352030358969086_130839 file=/user/emeij/icwsm-data-test/01-26-?>SOCIAL_MEDIA.tar.gz > at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.java:1977) a question for you: Does the exception always compl

Remove one directory from multiple dfs.data.dir, how?

2011-04-03 Thread elton sky
I have a HDFS with 10 nodes. Each nodes have 4 disks attached, so I assign 4 directories for hdfs in configuration: dfs.data.dir /data1/hdfs-data,/data2/hdfs-data,/data3/hdfs-data,/data4/hdfs-data Now I want to remove 1 disk from each node, say /data4/hdfs-data. What I should do to keep data i

Re: Remove one directory from multiple dfs.data.dir, how?

2011-04-04 Thread elton sky
? Why not hadoop has this functionality? On Mon, Apr 4, 2011 at 5:05 PM, Harsh Chouraria wrote: > Hello Elton, > > On Mon, Apr 4, 2011 at 11:44 AM, elton sky wrote: > > Now I want to remove 1 disk from each node, say /data4/hdfs-data. What I > > should do to keep data int

Re: Remove one directory from multiple dfs.data.dir, how?

2011-04-04 Thread elton sky
back (although the simpler version still stands). > > On Mon, Apr 4, 2011 at 4:51 PM, elton sky wrote: > > Thanks Harsh, > > I will give it a go as you suggested. > > But I feel it's not convenient in my case. Decommission is for taking > down a > > node. What

Re: Read files from hdfs

2011-05-08 Thread elton sky
Hassen, Read in hdfs is sequential, i.e. read one block after another. Each time the client will connect to one data node to read a block. Then connect to another (or the same) data node to read next block. The reason for this sequential design, I guess, is avoiding n/w traffic explosion in a heav

Re: Block Size

2011-06-17 Thread elton sky
This is a tradition from native file system, for avoiding the waste of disk space. In linux, each data block is 4K. A file is sliced into data blocks and stored on disk. If the tail block has less than 4K data, the rest of block space is wasted. So if all your files are multiple of 4K in linux, you