Block placement question

2019-04-21 Thread Shuubham Ojha
Hello everyone, I am trying to devise my own block placement policy in
hadoop 3. My query is that when I choose favored nodes is that choice valid
fora particular file or a particular block? What if I wish to send
different blocks of the same file to different nodes. How can I control
that? Any inputs will be greatly appreciated.

Warm Regards,
Shuubham Ojha


Re: block placement

2019-04-19 Thread Shuubham Ojha
Hello Hari, can you give me some insight as to how I can put the ip
addresses or rack numbers of datanodes in the choose target method.

Warm regards,
Shuubham

On Fri, Apr 19, 2019 at 10:05 PM Hariharan  wrote:

> Favoured nodes is a per-file property that you can set at the time of
> creation. Note that future rebalancing may not respect this. You can read
> more about it here -
> https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L1171
>
> If there is a fixed relation b/w your files and the datanode they need to
> be on, then as Hilmi suggested you are better off writing a custom policy
> that extends the default one. You would need to extend the chooseTarget
> method to apply your logic there.
>
> ~ Hari
>
> On Fri, Apr 19, 2019 at 12:44 PM Shuubham Ojha 
> wrote:
>
>> Hi Hilmi, thanks for your response. I have already done the following:
>> 1. Defined the rack topology in a script and added that to core-site.xml.
>> I think this ensures that the hadoop on namenode knows the ip addresses of
>> datanodes and corresponding rack numbers.
>> 2. I have gone through the default block placement policy code multiple
>> times and I can see that a parameter called favored nodes has been used. I
>> believe this parameter is of interest to me.
>>
>> The trouble is that I have a set of nodes with known ip addresses but I
>> cant see any specific ip addresses being used in the placement policy code.
>> I specifically want to know how to tell my hadoop code to place a
>> particular block of data on a particular datanode given it's ip address. I
>> think this problem can be resolved by setting the given datanode as favored
>> node for the time when the data block of interest is being dequeued to be
>> sent to datanodes but I can't how that should be done.
>>
>> Warm regards,
>> Shuubham Ojha
>>
>> On Fri, Apr 19, 2019 at 4:12 AM Hilmi Egemen Ciritoğlu <
>> hilmi.egemen.cirito...@gmail.com> wrote:
>>
>>> Hi Shuubham,
>>>
>>> You can simply create your own block placement class by extending
>>> BlockPlacementPolicy.
>>>  As a starting point, you may want to have a look at Block Placement
>>> Default Policy Source Code
>>> <https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java>
>>> .
>>>
>>> Regards,
>>> H. Egemen Ciritoglu
>>>
>>> On Thu, 18 Apr 2019 at 22:34, Shuubham Ojha 
>>> wrote:
>>>
>>>> Can anyone give me some idea as to how I can write my own block
>>>> placement strategy in hadoop 3.
>>>>
>>>>
>>>> Shuubham Ojha
>>>>
>>>


Re: block placement

2019-04-19 Thread Hariharan
Favoured nodes is a per-file property that you can set at the time of
creation. Note that future rebalancing may not respect this. You can read
more about it here -
https://github.com/apache/hadoop/blob/a55d6bba71c81c1c4e9d8cd11f55c78f10a548b0/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSClient.java#L1171

If there is a fixed relation b/w your files and the datanode they need to
be on, then as Hilmi suggested you are better off writing a custom policy
that extends the default one. You would need to extend the chooseTarget
method to apply your logic there.

~ Hari

On Fri, Apr 19, 2019 at 12:44 PM Shuubham Ojha 
wrote:

> Hi Hilmi, thanks for your response. I have already done the following:
> 1. Defined the rack topology in a script and added that to core-site.xml.
> I think this ensures that the hadoop on namenode knows the ip addresses of
> datanodes and corresponding rack numbers.
> 2. I have gone through the default block placement policy code multiple
> times and I can see that a parameter called favored nodes has been used. I
> believe this parameter is of interest to me.
>
> The trouble is that I have a set of nodes with known ip addresses but I
> cant see any specific ip addresses being used in the placement policy code.
> I specifically want to know how to tell my hadoop code to place a
> particular block of data on a particular datanode given it's ip address. I
> think this problem can be resolved by setting the given datanode as favored
> node for the time when the data block of interest is being dequeued to be
> sent to datanodes but I can't how that should be done.
>
> Warm regards,
> Shuubham Ojha
>
> On Fri, Apr 19, 2019 at 4:12 AM Hilmi Egemen Ciritoğlu <
> hilmi.egemen.cirito...@gmail.com> wrote:
>
>> Hi Shuubham,
>>
>> You can simply create your own block placement class by extending
>> BlockPlacementPolicy.
>>  As a starting point, you may want to have a look at Block Placement
>> Default Policy Source Code
>> <https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java>
>> .
>>
>> Regards,
>> H. Egemen Ciritoglu
>>
>> On Thu, 18 Apr 2019 at 22:34, Shuubham Ojha 
>> wrote:
>>
>>> Can anyone give me some idea as to how I can write my own block
>>> placement strategy in hadoop 3.
>>>
>>>
>>> Shuubham Ojha
>>>
>>


Re: block placement

2019-04-19 Thread Shuubham Ojha
Hi Hilmi, thanks for your response. I have already done the following:
1. Defined the rack topology in a script and added that to core-site.xml. I
think this ensures that the hadoop on namenode knows the ip addresses of
datanodes and corresponding rack numbers.
2. I have gone through the default block placement policy code multiple
times and I can see that a parameter called favored nodes has been used. I
believe this parameter is of interest to me.

The trouble is that I have a set of nodes with known ip addresses but I
cant see any specific ip addresses being used in the placement policy code.
I specifically want to know how to tell my hadoop code to place a
particular block of data on a particular datanode given it's ip address. I
think this problem can be resolved by setting the given datanode as favored
node for the time when the data block of interest is being dequeued to be
sent to datanodes but I can't how that should be done.

Warm regards,
Shuubham Ojha

On Fri, Apr 19, 2019 at 4:12 AM Hilmi Egemen Ciritoğlu <
hilmi.egemen.cirito...@gmail.com> wrote:

> Hi Shuubham,
>
> You can simply create your own block placement class by extending
> BlockPlacementPolicy.
>  As a starting point, you may want to have a look at Block Placement
> Default Policy Source Code
> <https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java>
> .
>
> Regards,
> H. Egemen Ciritoglu
>
> On Thu, 18 Apr 2019 at 22:34, Shuubham Ojha 
> wrote:
>
>> Can anyone give me some idea as to how I can write my own block
>> placement strategy in hadoop 3.
>>
>>
>> Shuubham Ojha
>>
>


Re: block placement

2019-04-18 Thread Hilmi Egemen Ciritoğlu
Hi Shuubham,

You can simply create your own block placement class by extending
BlockPlacementPolicy.
 As a starting point, you may want to have a look at Block Placement
Default Policy Source Code
<https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicyDefault.java>
.

Regards,
H. Egemen Ciritoglu

On Thu, 18 Apr 2019 at 22:34, Shuubham Ojha  wrote:

> Can anyone give me some idea as to how I can write my own block  placement
> strategy in hadoop 3.
>
>
> Shuubham Ojha
>


block placement

2019-04-18 Thread Shuubham Ojha
Can anyone give me some idea as to how I can write my own block  placement
strategy in hadoop 3.


Shuubham Ojha


Re: HDFS Block placement policy

2016-05-22 Thread Gurmukh Singh
the best practice is to have an Edge/Gateway node, so the there is no 
local copy of data. It is also good from a security perspective.


I think my this video can help you understand this better: 
https://www.youtube.com/watch?v=t20niJDO1f4


Regards
Gurmukh

On 20/05/16 12:29 AM, Ruhua Jiang wrote:

Hi all,

I have a question related to HDFS Block placement policy. The default,
"The default block placement policy is as follows: Place the first 
replica somewhere – either a random node (if the HDFS client is 
outside the Hadoop/DataNode cluster) or on the local node (if the HDFS 
client is running on a node inside the cluster). Place the second 
replica in a different rack"


Let's consider the situation that data are in *1 datanode local disk*, 
a *hdfs -put* command is used (which means HDFS client is on a 
datanode) to ingest this data into HDFS.
- What will happen (in terms of block placement) if this datanode 
local disk is full?
- Is there a list of available alternative block placement policy 
implemented, and hdfs -put can use it just by change the hdfs-site.xml 
 config?  I notice this https://issues.apache.org/jira/browse/HDFS-385 
JIRA ticket but it seems not what we want.
- I understand place first block on local machine can improve the 
perfermance, and  we can use HDFS balancer to solve the imblance 
problem afterwards. However, I just want to explore alternative 
solutions to avoid this problem at beginning.



Thanks

Ruhua Jiang


--
--
Thanks and Regards

Gurmukh Singh



Controlling the block placement and the file placement in HDFS writes

2014-12-13 Thread Ananth Gundabattula
Hello All,


I was wondering if the following issues can be solved by extending hdfs
classes with custom implementations if possible.


Here are my requirements :

1. Is there a way to control that all file blocks belonging to a particular
hdfs directory  file go to the same physical datanode ( and their
corresponding replicas as well ? )

2. Is there a way to control the volume that is being used to write a file
block ?


Here are the finer details to give some background for the above two
queries :

We are using Impala engine to analyze our data and to do so we are
generating the files in parquet format using our custom data processing
pipelines. It is ideal that all of the files belonging to the same
partition be processed by the same node ( as we see our queries can be
partitioned by say a key that is common across all queries). In this
regard, we would like to control a given file block in a particular path
always land on the same physical node so that the impala workers that need
to process data for a given query send less data across nodes to manage the
result. Hence the first question.


The second aspect is that our queries are time based and this time window
follows a familiar pattern of old data not being queried much. Hence we
would like to preserve the most recent data in the HDFS cache ( impala is
helping us manage this aspect via their command set ) but we would like the
next recent amount of data chunks to land on an SSD that is present on
every datanode. The remaining set of blocks which are very old but in
large quantities would land on spinning disks. The decision to choose a
given volume is based on the file name as we can control the filename that
is being used to generate the file.

Please note that the criteria for both of the above is being based on the
name of the file and not the size of the file.

For the first query, I have looked into BlockPlacementPolicy interface. The
method chooseTarget() in the above interface is giving me a list of
DataNodes to choose from and return one of them as the target ? My dilemma
is whether given a path for a directory , will the input
DataNodeDescriptors to the function chooseTarget remain the same for each
invocation to the same directory ? Or is it not something not controlled ?

For the second query, I have looked at the VolumeChoosingPolicy Interface.
On looking at the VolumeChoosingPolicy interface, it looks like the only
handle I get is the list of current volumes and no information about the
incoming file.

Any pointers regarding the above two aspects would be immensely helpful.

Thanks a lot for your time.

Regards,
Ananth


Block placement without rack aware

2014-10-02 Thread SF Hadoop
What is the block placement policy hadoop follows when rack aware is not
enabled?

Does it just round robin?

Thanks.


RE: Block placement without rack aware

2014-10-02 Thread Liu, Yi A
It’s still random.
If rack aware is not enabled, all nodes are in “default-rack”, and random nodes 
are chosen for block replications.

Regards,
Yi Liu

From: SF Hadoop [mailto:sfhad...@gmail.com]
Sent: Friday, October 03, 2014 7:12 AM
To: user@hadoop.apache.org
Subject: Block placement without rack aware

What is the block placement policy hadoop follows when rack aware is not 
enabled?

Does it just round robin?

Thanks.


Re: Block placement without rack aware

2014-10-02 Thread Pradeep Gollakota
It appears to be randomly chosen. I just came across this blog post from
Lars George about HBase file locality in HDFS
http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote:

 What is the block placement policy hadoop follows when rack aware is not
 enabled?

 Does it just round robin?

 Thanks.



Re: Block placement without rack aware

2014-10-02 Thread SF Hadoop
Thanks for the info.  Exactly what I needed.

Cheers.

On Thu, Oct 2, 2014 at 4:21 PM, Pradeep Gollakota pradeep...@gmail.com
wrote:

 It appears to be randomly chosen. I just came across this blog post from
 Lars George about HBase file locality in HDFS
 http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

 On Thu, Oct 2, 2014 at 4:12 PM, SF Hadoop sfhad...@gmail.com wrote:

 What is the block placement policy hadoop follows when rack aware is not
 enabled?

 Does it just round robin?

 Thanks.





Support multiple block placement policies

2014-09-15 Thread Zesheng Wu
Hi there,

According to the code, the current implement of HDFS only supports one
specific type of block placement policy, which is
BlockPlacementPolicyDefault by default.
The default policy is enough for most of the circumstances, but under some
special circumstances, it works not so well.

For example, on a shared cluster, we want to erasure encode all the files
under some specified directories. So the files under these directories need
to use a new placement policy.
But at the same time, other files still use the default placement policy.
Here we need to support multiple placement policies for the HDFS.

One plain thought is that, the default placement policy is still configured
as the default. On the other hand, HDFS can let user specify customized
placement policy through the extended attributes(xattr). When the HDFS
choose the replica targets, it firstly check the customized placement
policy, if not specified, it fallbacks to the default one.

Any thoughts?

-- 
Best Wishes!

Yours, Zesheng


Examining effect of changing block placement policy.

2014-08-11 Thread Arjun Bakshi

Hi,

I've made some changed to the default block placement policy and want to 
see how if affects a cluster. Any suggestions on how I can test the 
before and after of a cluster after making these changes?


I read up a bit on Rumen and GridMix in my search for tools that would 
help me benchmark things on a cluster. As far as I know, I need some job 
traces to get the ball rolling. I've googled for sample job traces but 
didn't find anything. I found this page: 
http://ftp.pdl.cmu.edu/pub/datasets/hla/dataset.html but I'm not sure 
how to use the data there.


I don't have a ton of data, or a bunch of queries I could run on it. My 
best idea till now is to run a bunch of sorts on different input sizes, 
and word counts on different combination of files, all while following 
an exponential inter-job arrival time. I'm planning to do this on AWS's 
EC2's free tire.


Any suggestions on how to observe the effects of changing the policy 
would be appreciated.


Thank you,

Arjun


Building custom block placement policy. What is srcPath?

2014-07-24 Thread Arjun Bakshi

Hi,

I want to write a block placement policy that takes the size of the file 
being placed into account. Something like what is done in CoHadoop or 
BEEMR paper. I have the following questions:


1- What is srcPath in chooseTarget? Is it the path to the original 
un-chunked file, or it is a path to a single block, or something else? I 
added some code to blockplacementpolicydefault to print out the value of 
srcPath but the results look odd.


2- Will a simple new File(srcPath) will do?

3- I've spent time looking at hadoop source code. I can't find a way to 
go from srcPath in chooseTarget to a file size. Every function I think 
can do it, in FSNamesystem, FSDirectory, etc., is either non-public, or 
cannot be called from inside the blockmanagement package or 
blockplacement class.


How do I go from srcPath in blockplacement class to size of the file 
being placed?


Thank you,

AB


Re: Building custom block placement policy. What is srcPath?

2014-07-24 Thread Harsh J
Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi baksh...@mail.uc.edu wrote:
 Hi,

 I want to write a block placement policy that takes the size of the file
 being placed into account. Something like what is done in CoHadoop or BEEMR
 paper. I have the following questions:

 1- What is srcPath in chooseTarget? Is it the path to the original
 un-chunked file, or it is a path to a single block, or something else? I
 added some code to blockplacementpolicydefault to print out the value of
 srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.

 2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.

 3- I've spent time looking at hadoop source code. I can't find a way to go
 from srcPath in chooseTarget to a file size. Every function I think can do
 it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
 called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.

 How do I go from srcPath in blockplacement class to size of the file being
 placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).

 Thank you,

 AB



-- 
Harsh J


Re: Building custom block placement policy. What is srcPath?

2014-07-24 Thread Arjun Bakshi

Hi,

Thanks for the reply. It cleared up a few things.

I hadn't thought of situations of under-replication, but I'll give it 
some thought now. It should be easier since, as you've mentioned, by 
that time the namenode knows all the blocks that came from the same file 
as the under-replicated block.


For the most part, I was thinking of when a new file is being placed on 
the cluster. I think this is what you called in-progress files. Say a 
new 1GB file needs to be placed on to the cluster. I want to make the 
system take information of the file being 1GB in size into account while 
placing all its blocks on to nodes in a cluster.


I'm not clear on where the file is broken down into blocks/chunks; in 
terms of which class, which file system(local or hdfs), or where in the 
process flow. Knowing that will help me come up with a solution. Where 
is the last place, in terms of a function or point in process that I can 
find the name of the original file that is being placed on the system?


I'm reading the namenode and fsnamesystem code just to see if I can do 
what I want from there. Any suggestions will be appreciated.


Thank you,

AB



On 07/24/2014 02:12 PM, Harsh J wrote:

Hello,

(Inline)

On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi baksh...@mail.uc.edu wrote:

Hi,

I want to write a block placement policy that takes the size of the file
being placed into account. Something like what is done in CoHadoop or BEEMR
paper. I have the following questions:

1- What is srcPath in chooseTarget? Is it the path to the original
un-chunked file, or it is a path to a single block, or something else? I
added some code to blockplacementpolicydefault to print out the value of
srcPath but the results look odd.

The arguments are documented in the interface javadoc:
https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

The srcPath is the file path of the file on HDFS for which the block
placement targets are being requested.


2- Will a simple new File(srcPath) will do?

Please rephrase? The srcPath is not a local file if thats what you meant.


3- I've spent time looking at hadoop source code. I can't find a way to go
from srcPath in chooseTarget to a file size. Every function I think can do
it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot be
called from inside the blockmanagement package or blockplacement class.

The block placement is something that, within a context of a new file
creation, is called when requesting a new block. At this point the
file is not complete, so there is no way to determine its actual
length, but only the requested block size. I'm not certain if
BlockPlacementPolicy is what will solve your goal.


How do I go from srcPath in blockplacement class to size of the file being
placed?

Are you targeting in-progress files or completed files? The latter
form of files would result in placement policy calls iff there's an
under-replication/losses/etc. to block replicas of the original set.
Only for such operations would you have a possibility to determine the
actual full length of file (as explained above).


Thank you,

AB







Re: Building custom block placement policy. What is srcPath?

2014-07-24 Thread Harsh J
Hi,

Inline.

On Fri, Jul 25, 2014 at 2:55 AM, Arjun Bakshi baksh...@mail.uc.edu wrote:
 Hi,

 Thanks for the reply. It cleared up a few things.

 I hadn't thought of situations of under-replication, but I'll give it some
 thought now. It should be easier since, as you've mentioned, by that time
 the namenode knows all the blocks that came from the same file as the
 under-replicated block.

 For the most part, I was thinking of when a new file is being placed on the
 cluster. I think this is what you called in-progress files. Say a new 1GB
 file needs to be placed on to the cluster. I want to make the system take
 information of the file being 1GB in size into account while placing all its
 blocks on to nodes in a cluster.

You are assuming that all files are loaded into the cluster from an
existing file on another FS, such as a local FS and the command
'hadoop fs -put', and thereby you can know what the file length is
entirely going to be. This is incorrect as an assumption.

Programs can write streams of arbitrary data based on their need,
also. A HDFS writer can simply create a new file, and write to its
output stream any number of bytes it wants to. To HDFS this is no
different than a load. It treats both such writes in the equal way -
its merely the client goal differing.

 I'm not clear on where the file is broken down into blocks/chunks; in terms
 of which class, which file system(local or hdfs), or where in the process
 flow. Knowing that will help me come up with a solution. Where is the last
 place, in terms of a function or point in process that I can find the name
 of the original file that is being placed on the system?

The client program chunks the file as it writes. You can look at the
DFSOutputStream class for the client implementation.

 I'm reading the namenode and fsnamesystem code just to see if I can do what
 I want from there. Any suggestions will be appreciated.

 Thank you,

 AB




 On 07/24/2014 02:12 PM, Harsh J wrote:

 Hello,

 (Inline)

 On Thu, Jul 24, 2014 at 11:11 PM, Arjun Bakshi baksh...@mail.uc.edu
 wrote:

 Hi,

 I want to write a block placement policy that takes the size of the file
 being placed into account. Something like what is done in CoHadoop or
 BEEMR
 paper. I have the following questions:

 1- What is srcPath in chooseTarget? Is it the path to the original
 un-chunked file, or it is a path to a single block, or something else? I
 added some code to blockplacementpolicydefault to print out the value of
 srcPath but the results look odd.

 The arguments are documented in the interface javadoc:

 https://github.com/apache/hadoop-common/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockPlacementPolicy.java#L61

 The srcPath is the file path of the file on HDFS for which the block
 placement targets are being requested.

 2- Will a simple new File(srcPath) will do?

 Please rephrase? The srcPath is not a local file if thats what you meant.

 3- I've spent time looking at hadoop source code. I can't find a way to
 go
 from srcPath in chooseTarget to a file size. Every function I think can
 do
 it, in FSNamesystem, FSDirectory, etc., is either non-public, or cannot
 be
 called from inside the blockmanagement package or blockplacement class.

 The block placement is something that, within a context of a new file
 creation, is called when requesting a new block. At this point the
 file is not complete, so there is no way to determine its actual
 length, but only the requested block size. I'm not certain if
 BlockPlacementPolicy is what will solve your goal.

 How do I go from srcPath in blockplacement class to size of the file
 being
 placed?

 Are you targeting in-progress files or completed files? The latter
 form of files would result in placement policy calls iff there's an
 under-replication/losses/etc. to block replicas of the original set.
 Only for such operations would you have a possibility to determine the
 actual full length of file (as explained above).

 Thank you,

 AB







-- 
Harsh J


HDFS block placement

2013-07-26 Thread Lukas Kairies

Hey,

I am a bit confused about the block placement in Hadoop. Assume that 
there is no replication and a task (map or reduce) writes a file to 
HDFS, will be all blocks stored on the same local node (the node on 
which the task runs)? I think yes but I am node sure.


Kind Regards,
Lukas Kairies


Re: HDFS block placement

2013-07-26 Thread Harsh J
Your thought is correct. If space is available locally, then it is
automatically stored locally.

On Fri, Jul 26, 2013 at 5:14 PM, Lukas Kairies
lukas.xtree...@googlemail.com wrote:
 Hey,

 I am a bit confused about the block placement in Hadoop. Assume that there
 is no replication and a task (map or reduce) writes a file to HDFS, will be
 all blocks stored on the same local node (the node on which the task runs)?
 I think yes but I am node sure.

 Kind Regards,
 Lukas Kairies



-- 
Harsh J


Re: Pluggable Block placement policy

2013-05-21 Thread Mohammad Mustaqeem
Hi Harsh,
 Here is silly question to you :)
I want to ask that where the PluggableBlockPlacementPolicy Code should be
put?
As Ivan describe how to configure, I think that it should put in
org.apache.hadoop.hdfs.server.namenode.
But when I put the file in this location and execute the mvn command, It
get fails :(
Please explain the procedure.


On Mon, May 20, 2013 at 12:39 PM, Harsh J ha...@cloudera.com wrote:

 Hi Mohammad,

 I believe we've already answered this earlier to an almost exact question:
 http://search-hadoop.com/m/b8FAa2eI6kj1. Does that not suffice? Is there
 something more specific you are looking for?


 On Mon, May 20, 2013 at 12:30 PM, Mohammad Mustaqeem 
 3m.mustaq...@gmail.com wrote:

 Has anybody used pluggable block placement policy?
 If yes then give direction to me, how to use this feature in hadoop
 2.0.3-alpha and I also need the code of pluggable block placement policy.

 Thanks in advance.

 --
 *With regards ---*
 *Mohammad Mustaqeem*,
  M.Tech (CSE)
 MNNIT Allahabad
 9026604270





 --
 Harsh J




-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad
9026604270


Pluggable Block placement policy

2013-05-20 Thread Mohammad Mustaqeem
Has anybody used pluggable block placement policy?
If yes then give direction to me, how to use this feature in hadoop
2.0.3-alpha and I also need the code of pluggable block placement policy.

Thanks in advance.

-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad
9026604270


Re: Pluggable Block placement policy

2013-05-20 Thread Harsh J
Hi Mohammad,

I believe we've already answered this earlier to an almost exact question:
http://search-hadoop.com/m/b8FAa2eI6kj1. Does that not suffice? Is there
something more specific you are looking for?


On Mon, May 20, 2013 at 12:30 PM, Mohammad Mustaqeem 3m.mustaq...@gmail.com
 wrote:

 Has anybody used pluggable block placement policy?
 If yes then give direction to me, how to use this feature in hadoop
 2.0.3-alpha and I also need the code of pluggable block placement policy.

 Thanks in advance.

 --
 *With regards ---*
 *Mohammad Mustaqeem*,
  M.Tech (CSE)
 MNNIT Allahabad
 9026604270





-- 
Harsh J


Block placement Policy

2013-05-04 Thread Mohammad Mustaqeem
I have read somewhere that a user can specified his own ReplicaPlacementPolicy.
How can I specify my own ReplicaPlacementPolicy?
I you have any sample ReplicaPlacementPolicy, then please share it..

-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad
9026604270


Re: Block placement Policy

2013-05-04 Thread Mohammad Tariq
You might find this useful : https://issues.apache.org/jira/browse/HDFS-385

Warm Regards,
Tariq
https://mtariq.jux.com/
cloudfront.blogspot.com


On Sat, May 4, 2013 at 8:57 PM, Mohammad Mustaqeem
3m.mustaq...@gmail.comwrote:

 I have read somewhere that a user can specified his own
 ReplicaPlacementPolicy.
 How can I specify my own ReplicaPlacementPolicy?
 I you have any sample ReplicaPlacementPolicy, then please share it..

 --
 *With regards ---*
 *Mohammad Mustaqeem*,
 M.Tech (CSE)
 MNNIT Allahabad
 9026604270



Re: Block placement Policy

2013-05-04 Thread Mohammad Mustaqeem
Do you know, how to use it?
Sorry, I am very new :(

-- 
*With regards ---*
*Mohammad Mustaqeem*,
M.Tech (CSE)
MNNIT Allahabad
9026604270


Re: Block placement Policy

2013-05-04 Thread Rajat Sharma
Have a look at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.
This is default placement policy, you should be able to extend this to
implement your own policy and set config parameter
dfs.block.replicator.classname to point to your class

On Sat, May 4, 2013 at 8:57 PM, Mohammad Mustaqeem
3m.mustaq...@gmail.com wrote:
 I have read somewhere that a user can specified his own 
 ReplicaPlacementPolicy.
 How can I specify my own ReplicaPlacementPolicy?
 I you have any sample ReplicaPlacementPolicy, then please share it..

 --
 *With regards ---*
 *Mohammad Mustaqeem*,
 M.Tech (CSE)
 MNNIT Allahabad
 9026604270


Can a jobtracker directly access datanode information (block placement) on namenode?

2012-07-20 Thread Kyungyong Lee
Hello all,

I want to get datanode information (related to block placement) that is
kept at a namenode from a jobtracker. As far as I understand, the
jobtracker uses the locality-of-data for job scheduling, so I believe the
jobtracker is keeping the information somewhere in the source code.
However, I could not find the location. Can anyone give me a starting point
(source code) where the jobtracker has access to block placement
information? Thanks.


Re: Can a jobtracker directly access datanode information (block placement) on namenode?

2012-07-20 Thread Harsh J
Hi,

Any HDFS client can request a list of block locations for a given file
path (node-level detail of where blocks are placed for a file), via
the FileSystem#getFileBlockLocations API:
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/fs/FileSystem.html#getFileBlockLocations(org.apache.hadoop.fs.FileStatus,%20long,%20long)

MR too gets this info via the user's InputFormat#getSplits method, and
schedules with these locations.

On Sat, Jul 21, 2012 at 2:06 AM, Kyungyong Lee iamkyungy...@gmail.com wrote:
 Hello all,

 I want to get datanode information (related to block placement) that is kept
 at a namenode from a jobtracker. As far as I understand, the jobtracker uses
 the locality-of-data for job scheduling, so I believe the jobtracker is
 keeping the information somewhere in the source code. However, I could not
 find the location. Can anyone give me a starting point (source code) where
 the jobtracker has access to block placement information? Thanks.




-- 
Harsh J


Data block placement in HDFS

2012-07-03 Thread Zheng Da
Hello,

It's interesting to see HDFS provided an interface BlockPlacementPolicy for
users to define their own data placement policy in Hadoop v0.21, v0.22, and
I think it's very useful for some applications. However, I can't find this
interface in the latest stable version of Hadoop v1.0. Is the interface
replaced by something else or abandoned completely? If it's abandoned,
could anyone tell me why?

Thanks,
Da


Re: Data block placement in HDFS

2012-07-03 Thread Robert Evans
I think this deals more with our messed up version numbering then anything.  
Hadoop-1.0 is not a release that is derived from Hadoop 0.21 or 0.22, it comes 
from the 0.20 line.  Hadoop-2.0 is what comes from v0.21 and 0.22. So Hadoop-2 
is going to have this, but Hadoop-1 will not.

--Bobby Evans

From: Zheng Da zhengda1...@gmail.commailto:zhengda1...@gmail.com
Reply-To: hdfs-user@hadoop.apache.orgmailto:hdfs-user@hadoop.apache.org 
hdfs-user@hadoop.apache.orgmailto:hdfs-user@hadoop.apache.org
To: hdfs-user@hadoop.apache.orgmailto:hdfs-user@hadoop.apache.org 
hdfs-user@hadoop.apache.orgmailto:hdfs-user@hadoop.apache.org
Subject: Data block placement in HDFS

Hello,

It's interesting to see HDFS provided an interface BlockPlacementPolicy for 
users to define their own data placement policy in Hadoop v0.21, v0.22, and I 
think it's very useful for some applications. However, I can't find this 
interface in the latest stable version of Hadoop v1.0. Is the interface 
replaced by something else or abandoned completely? If it's abandoned, could 
anyone tell me why?

Thanks,
Da


Re: Data block placement in HDFS

2012-07-03 Thread Zheng Da
Is v0.21 or v0.22 a stable version? The website doesn't say it explicitly.

Thanks,
Da

On Tue, Jul 3, 2012 at 12:06 PM, Robert Evans ev...@yahoo-inc.com wrote:

 I think this deals more with our messed up version numbering then
 anything.  Hadoop-1.0 is not a release that is derived from Hadoop 0.21 or
 0.22, it comes from the 0.20 line.  Hadoop-2.0 is what comes from v0.21 and
 0.22. So Hadoop-2 is going to have this, but Hadoop-1 will not.

 --Bobby Evans

 From: Zheng Da zhengda1...@gmail.com
 Reply-To: hdfs-user@hadoop.apache.org hdfs-user@hadoop.apache.org
 To: hdfs-user@hadoop.apache.org hdfs-user@hadoop.apache.org
 Subject: Data block placement in HDFS

 Hello,

 It's interesting to see HDFS provided an interface
 BlockPlacementPolicy for users to define their own data placement policy in
 Hadoop v0.21, v0.22, and I think it's very useful for some applications.
 However, I can't find this interface in the latest stable version of Hadoop
 v1.0. Is the interface replaced by something else or abandoned completely?
 If it's abandoned, could anyone tell me why?

 Thanks,
 Da



RE: block placement

2011-07-01 Thread Evert Lammerts
Well, here's my first Hadoop Jira :-)

https://issues.apache.org/jira/browse/MAPREDUCE-2636

Cheers,
Evert
 

From: Harsh J [ha...@cloudera.com]
Sent: Thursday, June 30, 2011 4:59 PM
To: hdfs-user@hadoop.apache.org
Subject: Re: block placement

Evert,

With the default behavior, every block request is handed a storage
device in a round robin fashion. So yes, parallel writes should be
well spread over the configured amount of disks. But that's not to say
that the file is perfectly distributed across the DNs as you describe.
The DNs are chosen randomly for writes (if no local one is available).

Regd. MR, I do not believe it does any such optimization right now (in
fact, the MR code is quite FS-agnostic). Right now, tasks are run on
nodes where blocks are located but metadata about which disk the block
may reside on is not maintained by the NN, so MR can't naturally know
this to do anything about. This would be good to discuss, however
(Search or file a new JIRA?)

On Thu, Jun 30, 2011 at 6:33 PM, Evert Lammerts evert.lamme...@sara.nl wrote:
 Hi list,

 How does the NN place blocks on the disks within a single node? Does it 
 spread out adjecent blocks of a single file horizontally over the disks? For 
 example, lets say I have four DN's and each has 4 disks. (And forget about 
 replication.) If I copy a file existing of 16 blocks of 128MB each to the 
 cluster, will each disk have exactly one block of the file?

 If I run some job over this file with its sixteen blocks this is important, 
 since the cluster would use its maximum I/O capabilities.

 This leads me to another question (which might be better of on mapred-user). 
 Does the JT schedule its tasks to maximally use I/O capabilities? Would it 
 try to process blocks that reside on a disk that is not currently being read 
 from or written to? Or does it just use a randomized strategy?

 Cheers,
 Evert



--
Harsh J


Re: block placement

2011-06-30 Thread Harsh J
Evert,

With the default behavior, every block request is handed a storage
device in a round robin fashion. So yes, parallel writes should be
well spread over the configured amount of disks. But that's not to say
that the file is perfectly distributed across the DNs as you describe.
The DNs are chosen randomly for writes (if no local one is available).

Regd. MR, I do not believe it does any such optimization right now (in
fact, the MR code is quite FS-agnostic). Right now, tasks are run on
nodes where blocks are located but metadata about which disk the block
may reside on is not maintained by the NN, so MR can't naturally know
this to do anything about. This would be good to discuss, however
(Search or file a new JIRA?)

On Thu, Jun 30, 2011 at 6:33 PM, Evert Lammerts evert.lamme...@sara.nl wrote:
 Hi list,

 How does the NN place blocks on the disks within a single node? Does it 
 spread out adjecent blocks of a single file horizontally over the disks? For 
 example, lets say I have four DN's and each has 4 disks. (And forget about 
 replication.) If I copy a file existing of 16 blocks of 128MB each to the 
 cluster, will each disk have exactly one block of the file?

 If I run some job over this file with its sixteen blocks this is important, 
 since the cluster would use its maximum I/O capabilities.

 This leads me to another question (which might be better of on mapred-user). 
 Does the JT schedule its tasks to maximally use I/O capabilities? Would it 
 try to process blocks that reside on a disk that is not currently being read 
 from or written to? Or does it just use a randomized strategy?

 Cheers,
 Evert



-- 
Harsh J


Local block placement policy, request

2011-05-26 Thread Jason Rutherglen
Is there a way to send a request to the name node to replicate
block(s) to a specific DataNode?  If not, what would be a way to do
this?  -Thanks


Re: Local block placement policy, request

2011-05-26 Thread Todd Lipcon
Hey Jason,

There are some non-public APIs to do this -- have a look at how the
Balancer works - the dispatch() function is the guts you're looking
for. It might be nice to expose this functionality as a limited
private evolving API.

In general, though, keep in mind that, whenever you write data, you'll
get a local copy first, if the writer is in the cluster. That's how
HBase gets locality for most of its accesses.

-Todd

On Thu, May 26, 2011 at 11:36 AM, Jason Rutherglen
jason.rutherg...@gmail.com wrote:
 Is there a way to send a request to the name node to replicate
 block(s) to a specific DataNode?  If not, what would be a way to do
 this?  -Thanks




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: Block placement in HDFS

2008-11-25 Thread Dhruba Borthakur
Hi Dennis,

There were some discussions on this topic earlier:

http://issues.apache.org/jira/browse/HADOOP-3799

Do you have any specific use-case for this feature?

thanks,
dhruba

On Mon, Nov 24, 2008 at 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote:


 On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:

  Hi Dennis,
  I don't think that is possible to do.


 No, it is not possible.

   The block placement is determined
 by HDFS internally (which is local, rack local and off rack).


 Actually, it was changed in 0.17 or so to be node-local, off-rack, and a
 second node off rack.

 -- Owen



Re: Block placement in HDFS

2008-11-25 Thread Pete Wyckoff

Fyi - Owen is referring to:

https://issues.apache.org/jira/browse/HADOOP-2559


On 11/24/08 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote:



On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:

 Hi Dennis,
  I don't think that is possible to do.

No, it is not possible.

  The block placement is determined
 by HDFS internally (which is local, rack local and off rack).

Actually, it was changed in 0.17 or so to be node-local, off-rack, and
a second node off rack.

-- Owen




Re: Block placement in HDFS

2008-11-25 Thread Hyunsik Choi
Hi All,

I try to divide some data into partitions explicitly (like regions of
Hbase). I wonder the following way to do is the best method.

For example, when we assume a block size 64MB, a file potion
corresponding to 0~63MB is allocated to first block?

I have three questions as follows:

Is the above method valid?
Is it the best method?
Is there alternative method?

Thank in advance.

-- 
Hyunsik Choi
Database  Information Systems Group
Dept. of Computer Science  Engineering, Korea University


On Mon, 2008-11-24 at 20:44 -0800, Mahadev Konar wrote:
 Hi Dennis,
   I don't think that is possible to do.  The block placement is determined
 by HDFS internally (which is local, rack local and off rack).
 
 
 mahadev
 
 
 On 11/24/08 6:59 PM, dennis81 [EMAIL PROTECTED] wrote:
 
  
  Hi everyone,
  
  I was wondering whether it is possible to control the placement of the
  blocks of a file in HDFS. Is it possible to instruct HDFS about which nodes
  will hold the block replicas?
  
  Thanks!




Re: Block placement in HDFS

2008-11-24 Thread Owen O'Malley


On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:


Hi Dennis,
 I don't think that is possible to do.


No, it is not possible.


 The block placement is determined
by HDFS internally (which is local, rack local and off rack).


Actually, it was changed in 0.17 or so to be node-local, off-rack, and  
a second node off rack.


-- Owen


Random block placement

2008-08-12 Thread John DeTreville
My understanding is that HDFS places blocks randomly. As I would expect,
then,
when I use hadoop fsck to look at block placements for my files, I see
that
some nodes have more blocks than the average. I would expect that these
hot
spots would cause a performance hit relative to a more even placement of
blocks.

I'd like to experiment with non-random block placement to see if this
can
provide a performance improvement. Where in the code would I start
looking to
find the existing code for random placement?

Cheers,
John


Re: Random block placement

2008-08-12 Thread lohit
Hi John,

This file should be a good starting point for you. 
src/hdfs/org/apache/hadoop/hdfs/server/namenode/ReplicationtargetChooser.java


There has been discussions about a pluggable block place policy 
https://issues.apache.org/jira/browse/HADOOP-3799

Thanks,
Lohit

- Original Message 
From: John DeTreville [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, August 12, 2008 3:31:49 PM
Subject: Random block placement

My understanding is that HDFS places blocks randomly. As I would expect,
then,
when I use hadoop fsck to look at block placements for my files, I see
that
some nodes have more blocks than the average. I would expect that these
hot
spots would cause a performance hit relative to a more even placement of
blocks.

I'd like to experiment with non-random block placement to see if this
can
provide a performance improvement. Where in the code would I start
looking to
find the existing code for random placement?

Cheers,
John



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Colin Evans
When you factor in colo, power, configuration, and administration costs, 
fewer boxes is always cheaper.  I'm not expecting a 2x speedup for the 
extra CPU, but I'm curious what the hit is.




Ted Dunning wrote:

Doesn't the incremental CPU cost you as much as an entire extra box?


On 2/12/08 12:19 PM, Colin Evans [EMAIL PROTECTED] wrote:

  

The big question for me is how well a dual-CPU 4-core (8 cores per box)
configuration will do.  Has anyone tried out this configuration with
Intel or AMD CPUs?  Is the memory throughput sufficient?



  




RE: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Joydeep Sen Sarma
There may still be remaining issues with. One I am aware of is
https://issues.apache.org/jira/browse/HADOOP-2677 where smaller capacity
nodes become too highly utilized to store mapred intermediate output.


-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, February 12, 2008 12:02 PM
To: core-user@hadoop.apache.org
Subject: Re: Question on DFS block placement and 'what is a rack' wrt
DFS block placement

We are currently running 15.3, and hope to move to 16.1 when it comes
out...
Where the heterogeneous disk space issues fixed in15.3?

Ted Dunning wrote:
 I have had issues with machines that are highly disparate in terms of
disk
 space.  I expect that some of those issues have been mitigated in
recent
 releases.


 On 2/12/08 11:51 AM, Jason Venner [EMAIL PROTECTED] wrote:

   
 We are starting to build larger clusters, and want to better
understand
 how to configure the network topology.
 Up to now we have just been setting up a private vlan for the small
 clusters.

 We have been thinking about the following machine configurations
 Compute nodes with a number of spindles and medium disk, that also
serve DFS
 For every 4-8 of the above, one compute node with a large number of
 spindles with a large number of disks, to bulk out th DFS capacity.

 We are wondering what the best practices are for network topology in
 clusters that are built out of the above building blocks.
 We can readily have 2 or 4 network cards in each node.
 

   


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning

It isn't popular much anymore, but once upon a time, network topology for
clustering was a big topic.  Since then, switches have gotten pretty fast
and worrying about these things has gone out of fashion a bit other than
something on the level of the current rack-aware locality in Hadoop.

With 4 NIC's, you could replay some history, however, by building what
amounts to a 4-dimensional hyper-torus.  I *think* you could pretty well by
having four parallel two-level switch networks and assign boxes to second
level switches according to a systematic pattern so that you would have
local access to much of the cluster.  As a simple example, suppose that you
have 16 machines M1 through M16, each with two interfaces.  You would have
two top level switches T1 and T2 and each of those would have four second
level switches S1.1 ... S1.4 connected to T1 and S2.1 ... S2.4 connected to
T2.  The machine connectivity on each interface would look like this:

Machineeth0   eth1
 1   S1.1   S2.1
 2   S1.1   S2.2
 3   S1.1   S2.3
 4   S1.1   S2.4
 5   S1.2   S2.1
 6   S1.2   S2.2
 7   S1.2   S2.3
 8   S1.2   S2.4
 9   S1.3   S2.1
10  S1.3S2.2
11  S1.3S2.3
12  S1.3S2.4
13  S1.4S2.1
14  S1.4S2.2
15  S1.4S2.3
16  S1.4S2.4

With this arrangement machine M1 is on a local switch with M2, M3, M4, M5,
M9, and M13 which is twice as many machines as would be local if only one
interface were used.  With four interfaces and four machines on a local
switch, the entire cluster is local and T1 and T2 should see no traffic.  In
a larger cluster, you get the same 16x benefit in locality.

The cost is that your ops guys will kill you if you suggest something this
elaborate.  The wiring between racks alone will make this a nightmare.


On 2/12/08 3:01 PM, Jason Venner [EMAIL PROTECTED] wrote:

 In terms of more exotic situations we were discussing having 4 NIC's 1
 for the local subset, 1 for a pair of local subsets, 1 for another pair
 of local subsets, 1 for the backbone.



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning


Doesn't the incremental CPU cost you as much as an entire extra box?


On 2/12/08 12:19 PM, Colin Evans [EMAIL PROTECTED] wrote:

 The big question for me is how well a dual-CPU 4-core (8 cores per box)
 configuration will do.  Has anyone tried out this configuration with
 Intel or AMD CPUs?  Is the memory throughput sufficient?



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning

I would concur that it is much better to have sufficient storage in the
compute farm for DFS files to be local for the compute tasks.

Also, a 16 disk machine typically costs a good bit more than a 6 disk
machine + 10 disks because you usually require a second chassis.  Sun's
Thumper would be an interesting counter-example of this.

I have found (in my limited experience) that you want as many disk
controllers as you can get and that you want the disk as close to your
compute power as possible.  For me, that means that my ideal machine is a
moderate CPU or two attached to 1-3 TB of storage.  My smallest machines
have slow CPU with two SATA drives (could be 2 x 500GB, but mostly are 500GB
+ 73GB for historical reasons).  These machines can be had for $500
second-hand and $1000 new from reputable vendors.  My larger machines have
6 disks and dual Xeons, but cost about $3-4K and only have about twice the
net Hadoop throughput and take up twice the rack space.  I would *much*
rather have 6 times as many of the little boxes.


On 2/12/08 1:01 PM, Doug Cutting [EMAIL PROTECTED] wrote:

 From my reading, I conjecture that an ideal configuration would be 1
 local disk per cpu for local data/reducing, and some number of separate
 disks for dfs.
 Is this an accurate assessment?
 
 DFS storage is typically local on compute nodes.



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting

Jason Venner wrote:
Is disk arm contention (seek) a problem in a 6 disk configuration as 
most likely all of the disks would be serving /local/ and /dfs/?


It should not be.  MapReduce i/o is is sequential, in chunks large 
enough that seeks should not dominate.


Doug


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning

Why not down-grade the CPU power and increase the number of chassis to get
more disks (and controllers and network interfaces)?

On 2/12/08 12:53 PM, Jason Venner [EMAIL PROTECTED] wrote:

 We have 3 types of machines we can get, 2  disk, 6 disk  and 16 disk
 machines. *They all have 4 dual core cpus.*



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Doug Cutting

Jason Venner wrote:
We have 3 types of machines we can get, 2  disk, 6 disk  and 16 disk 
machines. They all have 4 dual core cpus.


The 2 disk machines have about 1 TB, the 6 disks about 3TB and the 16 
disk about 8TB. The 16 disk machines have about 25% slower CPU's than 
the 2/6 disk machines.


We handle a lot of bulky data, and don't think we can fit it all o the 
3TB machines if those are our sole compute/dfs nodes.


Your performance will be better if you buy enough of the 6 disk nodes to 
hold all your data than if you intermix 16 disk nodes.  Are the 16 disk 
nodes considerably cheaper per byte stored than the 6 disk boxes?


From my reading, I conjecture that an ideal configuration would be 1 
local disk per cpu for local data/reducing, and some number of separate 
disks for dfs.

Is this an accurate assessment?


DFS storage is typically local on compute nodes.

Doug


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Jason Venner
We run the intel cpu's. The newer NorthBridges seem to have more memory 
bandwidth than the older ones.
We have a mix of special purpose assembly we call from java, so we are 
locked into intel right now.
The performance benchmarks I have seen suggest java runs about 30% 
faster on the AMD's due to the higher memory bandwidth.


Colin Evans wrote:
Because of acquiring servers of different capacities at different 
times, we have 2 servers with 1TB of disk each, and 11 servers with 
~300GB each.  The 1TB servers tend to be under-utilized by HDFS given 
their capacity.  This makes sense, as block replicas need to be 
relatively evenly distributed across the cluster in order to allow 
tasks to be run close to data.  For out next cluster, we're going with 
uniform disk, CPU, and memory configurations.
The big question for me is how well a dual-CPU 4-core (8 cores per 
box) configuration will do.  Has anyone tried out this configuration 
with Intel or AMD CPUs?  Is the memory throughput sufficient?



Jason Venner wrote:
We are starting to build larger clusters, and want to better 
understand how to configure the network topology.
Up to now we have just been setting up a private vlan for the small 
clusters.


We have been thinking about the following machine configurations
Compute nodes with a number of spindles and medium disk, that also 
serve DFS
For every 4-8 of the above, one compute node with a large number of 
spindles with a large number of disks, to bulk out th DFS capacity.


We are wondering what the best practices are for network topology in 
clusters that are built out of the above building blocks.

We can readily have 2 or 4 network cards in each node.




--
Jason Venner
Attributor - Publish with Confidence http://www.attributor.com/
Attributor is hiring Hadoop Wranglers, contact if interested


Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Colin Evans
Because of acquiring servers of different capacities at different times, 
we have 2 servers with 1TB of disk each, and 11 servers with ~300GB 
each.  The 1TB servers tend to be under-utilized by HDFS given their 
capacity.  This makes sense, as block replicas need to be relatively 
evenly distributed across the cluster in order to allow tasks to be run 
close to data.  For out next cluster, we're going with uniform disk, 
CPU, and memory configurations. 

The big question for me is how well a dual-CPU 4-core (8 cores per box) 
configuration will do.  Has anyone tried out this configuration with 
Intel or AMD CPUs?  Is the memory throughput sufficient?



Jason Venner wrote:
We are starting to build larger clusters, and want to better 
understand how to configure the network topology.
Up to now we have just been setting up a private vlan for the small 
clusters.


We have been thinking about the following machine configurations
Compute nodes with a number of spindles and medium disk, that also 
serve DFS
For every 4-8 of the above, one compute node with a large number of 
spindles with a large number of disks, to bulk out th DFS capacity.


We are wondering what the best practices are for network topology in 
clusters that are built out of the above building blocks.

We can readily have 2 or 4 network cards in each node.




Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Ted Dunning


I have had issues with machines that are highly disparate in terms of disk
space.  I expect that some of those issues have been mitigated in recent
releases.


On 2/12/08 11:51 AM, Jason Venner [EMAIL PROTECTED] wrote:

 We are starting to build larger clusters, and want to better understand
 how to configure the network topology.
 Up to now we have just been setting up a private vlan for the small
 clusters.
 
 We have been thinking about the following machine configurations
 Compute nodes with a number of spindles and medium disk, that also serve DFS
 For every 4-8 of the above, one compute node with a large number of
 spindles with a large number of disks, to bulk out th DFS capacity.
 
 We are wondering what the best practices are for network topology in
 clusters that are built out of the above building blocks.
 We can readily have 2 or 4 network cards in each node.



Re: Question on DFS block placement and 'what is a rack' wrt DFS block placement

2008-02-12 Thread Jason Venner

We are currently running 15.3, and hope to move to 16.1 when it comes out...
Where the heterogeneous disk space issues fixed in15.3?

Ted Dunning wrote:

I have had issues with machines that are highly disparate in terms of disk
space.  I expect that some of those issues have been mitigated in recent
releases.


On 2/12/08 11:51 AM, Jason Venner [EMAIL PROTECTED] wrote:

  

We are starting to build larger clusters, and want to better understand
how to configure the network topology.
Up to now we have just been setting up a private vlan for the small
clusters.

We have been thinking about the following machine configurations
Compute nodes with a number of spindles and medium disk, that also serve DFS
For every 4-8 of the above, one compute node with a large number of
spindles with a large number of disks, to bulk out th DFS capacity.

We are wondering what the best practices are for network topology in
clusters that are built out of the above building blocks.
We can readily have 2 or 4 network cards in each node.