s for dfs.block.size. MR programs should carry it
as well, and you may verify that by checking a job.xml of a job. If it
doesn't have the proper value, ensure the submitting user has proper
configs with the block size you want them to use.
However, folks can still override client configs if th
b set the dfs.block.size
> to 128 mb
>
> --Original Message--
> From: Anurag Tangri
> To: hdfs-u...@hadoop.apache.org
> To: common-user@hadoop.apache.org
> ReplyTo: common-user@hadoop.apache.org
> Subject: change hdfs block size for file existing on HDFS
> Sent: Ju
Hi Anurag,
The easiest option would be , in your map reduce job set the dfs.block.size to
128 mb
--Original Message--
From: Anurag Tangri
To: hdfs-u...@hadoop.apache.org
To: common-user@hadoop.apache.org
ReplyTo: common-user@hadoop.apache.org
Subject: change hdfs block size for file
Hi,
We have a situation where all files that we have are 64 MB block size.
I want to change these files (output of a map job mainly) to 128 MB blocks.
What would be good way to do this migration from 64 mb to 128 mb block
files ?
Thanks,
Anurag Tangri
hi,
Here is some useful info:
A small file is one which is significantly smaller than the HDFS block size
(default 64MB). If you’re storing small files, then you probably have lots of
them (otherwise you wouldn’t turn to Hadoop), and the problem is that HDFS
can’t handle lots of files.
Every
On 29 September 2011 18:39, lessonz wrote:
> I'm new to Hadoop, and I'm trying to understand the implications of a 64M
> block size in the HDFS. Is there a good reference that enumerates the
> implications of this decision and its effects on files stored in the system
> as w
I'm new to Hadoop, and I'm trying to understand the implications of a 64M
block size in the HDFS. Is there a good reference that enumerates the
implications of this decision and its effects on files stored in the system
as well as map-reduce jobs?
Thanks.
Hi, Joey:
Thanks for your help!
2011-09-21
hao.wang
发件人: Joey Echeverria
发送时间: 2011-09-21 10:10:54
收件人: common-user
抄送:
主题: Re: block size
HDFS blocks are stored as files in the underlying filesystem of your
datanodes. Those files do not take a fixed amount of space, so if you
overhead by having to
track a larger number of small files. So, if you can merge files, it's
best practice to do so.
-Joey
On Tue, Sep 20, 2011 at 9:54 PM, hao.wang wrote:
> Hi All:
> I have lots of small files stored in HDFS. My HDFS block size is 128M. Each
> file is significantl
Hi All:
I have lots of small files stored in HDFS. My HDFS block size is 128M. Each
file is significantly smaller than the HDFS block size. Then, I want to know
whether the small file used 128M in HDFS?
regards
2011-09-21
hao.wang
But I am not sure, whether the namenode
>> will divide those 128MB
>>
>> blocks to smaller ones in future or not.
>>
>> Chen
>>
>> On Wed, May 4, 2011 at 3:00 PM, Harsh J wrote:
>>
>>> Your client (put) machine must have the same block size config
t a problem.
>
> I like your second solution. But I am not sure, whether the namenode
> will divide those 128MB
>
> blocks to smaller ones in future or not.
>
> Chen
>
> On Wed, May 4, 2011 at 3:00 PM, Harsh J wrote:
>
>> Your client (put) machine must have the
not.
Chen
On Wed, May 4, 2011 at 3:00 PM, Harsh J wrote:
> Your client (put) machine must have the same block size configuration
> during upload as well.
>
> Alternatively, you may do something explicit like `hadoop dfs
> -Ddfs.block.size=size -put file file`
>
> On Thu, Ma
Your client (put) machine must have the same block size configuration
during upload as well.
Alternatively, you may do something explicit like `hadoop dfs
-Ddfs.block.size=size -put file file`
On Thu, May 5, 2011 at 12:59 AM, He Chen wrote:
> Hi all
>
> I met a problem about changing b
Hi all
I met a problem about changing block size from 64M to 128M. I am sure I
modified the correct configuration file hdfs-site.xml. Because I can change
the replication number correctly. However, it does not work on block size
changing.
For example:
I change the dfs.block.size to 134217728
ll be combined,if there is a combine
function.So,the data size is really uncertain during the process .From
HDFS's pespective,it can just feel that the data come group by group ,no
idea about the io.sort.mb which is the buffer's total size.
that's why I think setting HDFS block size to c
Hi Walker,
Thanks for your feedback. I was actually thinking that io.sort.mb could be
some factor of block size and not equal to block size. This will avoid
re-tuning of sort buffer sizes and spill threshold values for different HDFS
block sizes. Am I missing something?
Thanks,
-Shrinivas
On
shrinked to some degree which
we are not sure.
In one word,the data's finally size are uncertain.so,the this fact to config
HDFS block size kind of meaningless.
Good Luck
Walker Gu.
2011/4/13 Shrinivas Joshi
> Looking at workloads like TeraSort where intermediate map output is
> prop
I'm using the append 0.20.3 branch and am wondering why the following
fails, where setting the block size either in the Configuration or the
DFSClient.create method causes a failure later on when writing a file
out.
Configuration conf = new Configuration();
long blockSize = (long)32 * 1024 *
Looking at workloads like TeraSort where intermediate map output is
proportional to HDFS block size, I was wondering whether it would be
beneficial to have a mechanism for setting buffer spaces like io.sort.mb to
be a certain factor of HDFS block size? I am sure there are other config
parameters
Harsh, thanks, and sounds good!
On Tue, Apr 12, 2011 at 7:08 AM, Harsh J wrote:
> Hey Jason,
>
> On Tue, Apr 12, 2011 at 7:06 PM, Jason Rutherglen
> wrote:
>> Are there performance implications to setting the block size to 1 GB
>> or higher (via the DFSClient.creat
Hey Jason,
On Tue, Apr 12, 2011 at 7:06 PM, Jason Rutherglen
wrote:
> Are there performance implications to setting the block size to 1 GB
> or higher (via the DFSClient.create method)?
You'll be streaming 1 complete GB per block to a DN with that value
(before the next block gets s
Are there performance implications to setting the block size to 1 GB
or higher (via the DFSClient.create method)?
ot large enough to occupy the full
> size of the block.
>
> From the statement (cite from the book)
> "Unlike a filesystem for a single disk, a file in HDFS that is smaller than a
> single block does not occupy a full block’s worth of underlying storage." I
> can understa
aller than a
single block does not occupy a full block’s worth of underlying storage." I can
understand the physical space left from the initial block size will be free. My
question is can the underlying operating reuse/write this remained free space?
I'll look forward for your answers.
Thank you,
Florin
currenly, I got a problem to reduce the output of mappers.
11/02/23 09:57:45 INFO input.FileInputFormat: Total input paths to
process : 4157
11/02/23 09:57:47 WARN conf.Configuration: mapred.map.tasks is
deprecated. Instead, use mapreduce.job.maps
11/02/23 09:57:47 INFO mapreduce.JobSubmitter:
Yeah,
That's not gonna work. You need to pre-process your input files to
concatenate them into larger files and then set your dfs.blocksize
accordingly. Otherwise your jobs will be slow, slow slow.
tish
On Tue, Feb 22, 2011 at 3:57 AM, Jun Young Kim wrote:
> hi, all.
>
> I know dfs.blocksize
hi, all.
I know dfs.blocksize key can affect the performance of a hadoop.
in my case, I have thousands of directories which are including so many
different sized input files.
(file sizes are from 10K to 1G).
in this case, How I can assume the dfs.blocksize to get a best performance?
11/02/22
> (mapred.min.split.size can be only set to larger than HDFS block size)
>
I haven't tried this on a new mapreduce API, but
-Dmapred.min.split.size= -Dmapred.map.tasks=1
I think this would let you set a split size smaller than the hdfs block size :)
Koji
On 2/17/11 2:
Generally, if you have large files, setting the block size to 128M or larger is
helpful. You can do that on a per file basis or set the block size for the
whole filesystem. The larger block size cuts down on the number of map tasks
required to handle the overall data size. I've experim
Hi,
I'm recently benchmarking Hadoop. I know two ways to control the input data
size for each map task(): by changing the HDFS block size (have to reload
data into HDFS in this case), or by setting mapred.min.split.size.
For my benchmarking task, I need to change the input size for a map
That's correct. That is why teragen, the program that generates data to be
sorted in terasort is a MR program :-)
- Milind
On Oct 21, 2010, at 9:47 PM, elton sky wrote:
> Milind,
>
> You are right. But that only happens when your client is one of the data
> nodes in HDFS. otherwise a random no
Milind,
You are right. But that only happens when your client is one of the data
nodes in HDFS. otherwise a random node will be picked up for the first
replica.
On Fri, Oct 22, 2010 at 3:37 PM, Milind A Bhandarkar
wrote:
> If a file of say, 12.5 GB were produced by a single task with replication
If a file of say, 12.5 GB were produced by a single task with replication 3,
the default replication policy will ensure that the first replica of each block
will be created on local datanode. So, there will be one datanode in the
cluster that contains one replica of all blocks of that file. Map
Hmm, this is interesting: how did it manage to keep the blocks local? Why
performance was better?
On Thu, Oct 21, 2010 at 11:43 AM, Owen O'Malley wrote:
> The block sizes were 2G. The input format made splits that were more than a
> block because that led to better performance.
>
> -- Owen
>
The block sizes were 2G. The input format made splits that were more than a
block because that led to better performance.
-- Owen
I thought the petasort benchmark you published used 12.5G block sizes. How
did you make that work?
On Mon, Oct 18, 2010 at 4:27 PM, Owen O'Malley wrote:
> Block sizes larger than 2**31 are known to not work. I haven't ever tracked
> down the problem, just set my block size to
On 18/10/10 23:07, Michael Segel wrote:
Ok, I'll bite.
Why would you want to use a block size of> 2GB?
1. Some of the events coming off large physics devices are single
self-contained files of 3+ GB size; having a block size which has an
event in a single block guarantees locality
I am curious, any specific reason to make it smaller than 2**31?
On Tue, Oct 19, 2010 at 10:27 AM, Owen O'Malley wrote:
> Block sizes larger than 2**31 are known to not work. I haven't ever tracked
> down the problem, just set my block size to be smaller than that.
>
> -- Owen
>
On Oct 18, 2010, at 4:08 PM, elton sky wrote:
>> Why would you want to use a block size of > 2GB?
> For keeping a maps input split in a single block~
Just use mapred.min.split.size + multifileinputformat.
If there is a hard requirement for input split being one block you could just
make your input split fit a smaller block size.
Just saying, in case you can't overcome the 2G ceiling
J
Sent from my mobile. Please excuse the typos.
On 2010-10-18, at 5:08 PM, "elton sky" wrote
Block sizes larger than 2**31 are known to not work. I haven't ever
tracked down the problem, just set my block size to be smaller than
that.
-- Owen
>Why would you want to use a block size of > 2GB?
For keeping a maps input split in a single block~
On Tue, Oct 19, 2010 at 9:07 AM, Michael Segel wrote:
>
> Ok, I'll bite.
> Why would you want to use a block size of > 2GB?
>
>
>
> > Date: Mon, 18 Oct 2010 2
Ok, I'll bite.
Why would you want to use a block size of > 2GB?
> Date: Mon, 18 Oct 2010 21:33:34 +1100
> Subject: BUG: Anyone use block size more than 2GB before?
> From: eltonsky9...@gmail.com
> To: common-user@hadoop.apache.org
>
> Hello,
>
> In
> hd
On Oct 18, 2010, at 3:33 AM, elton sky wrote:
>
>
> When I use blockSize bigger than 2GB, which is out of the boundary of
> integer something weird would happen. For example, for a 3GB block it will
> create more than 2Million packets.
>
> Anyone noticed this before?
https://issues.apache.o
Hello,
In
hdfs.org.apache.hadoop.hdfs.DFSClient
.DFSOutputStream.writeChunk(byte[]
b, int offset, int len, byte[] checksum)
The second last line:
int psize = Math.min((int)(blockSize-bytesCurBlock), writePacketSize);
When I use blockSize bigger than 2GB, which is out of the boundary of
integer
That makes sense. Thanks Alex and Jeff.
-Gang
- 原始邮件
发件人: Alex Kozlov
收件人: common-user@hadoop.apache.org
发送日期: 2010/9/8 (周三) 1:31:14 下午
主 题: Re: change HDFS block size
The block size is a per-file property, so it will change only for the newly
created files. If you want to
The block size is a per-file property, so it will change only for the newly
created files. If you want to change the block size for the 'legacy' files,
you'll need to recreate them, for example with the distcp command (for the
new block size 512M):
*
hadoop distcp -D dfs.block
Those lagacy files won't change block size (NameNode have the mapping
between block and file)
only the new added files will have the block size of 128m
On Tue, Sep 7, 2010 at 7:27 PM, Gang Luo wrote:
> Hi all,
> I need to change the block size (from 128m to 64m) and have to sh
Hi all,
I need to change the block size (from 128m to 64m) and have to shut down the
cluster first. I was wondering what will happen to the current files on HDFS
(with 128M block size). Are they still there and usable? If so, what is the
block size of those lagacy files?
Thanks,
-Gang
t; central catalog. If you start with a POSIX filesystem namespace (and the
>> guarantees it implies), what rules must you relax in order to arrive at DNS?
>> On the scale of managing million (billion? ten billion? trillion?) files,
>> are any of the assumptions relevant?
>
tions, but I suspect they become
important over the next 10 years.
Brian
PS - I starting thinking along these lines during MSST when the LLNL guy was speculating
about what it meant to "fsck" a file system with 1 trillion files.
On May 18, 2010, at 12:56 PM, Konstantin Shvachko
Okay, sorry then, I misunderstood.
I think I could aswell run it on empty files, I would only get task startup
overhead.
Thank you.
On Tue, May 18, 2010 at 11:36 PM, Patrick Angeles wrote:
> That wasn't sarcasm. This is what you do:
>
> - Run your mapreduce job on 30k small files.
> - Consolidate
avoid the
> NN load and decreasing the HDFS block size.
>
>
Yes, or CombineFileInputFormat. JVM reuse also helps somewhat, so long as
you're not talking about hundreds of thousands of files (in which case it
starts to hurt JT load with that many tasks in jobs)
There are a number of ways
I'm not familiar with how to use/create them, but shouldn't a HAR (Hadoop
Archive) work well in this situation? I thought it was designed to collect
several small files together through another level indirection to avoid the NN
load and decreasing the HDFS block size.
Nick Jones
---
That wasn't sarcasm. This is what you do:
- Run your mapreduce job on 30k small files.
- Consolidate your 30k small files into larger files.
- Run mapreduce ok the larger files.
- Compare the running time
The difference in runtime is made up by your task startup and seek overhead.
If you want to
Thanks for the sarcasm but with 3 small files and so, 3 Mapper
instatiations, even though it's not (and never did I say it was) he only
metric that matters, it seem to me lie something very interresting to check
out...
I have hierarchy over me and they will be happy to understand my choices
tem with 1 trillion files.
On May 18, 2010, at 12:56 PM, Konstantin Shvachko wrote:
> You can also get some performance numbers and answers to the block size
> dilemma problem here:
>
> http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
>
> I re
I had an experiment with block size of 10 bytes (sic!). This was _very_ slow
on NN side. Like writing 5 Mb was happening for 25 minutes or so :( No fun to
say the least...
On Tue, May 18, 2010 at 10:56AM, Konstantin Shvachko wrote:
> You can also get some performance numbers and answers to
You can also get some performance numbers and answers to the block size dilemma
problem here:
http://developer.yahoo.net/blogs/hadoop/2010/05/scalability_of_the_hadoop_dist.html
I remember some people were using Hadoop for storing or streaming videos.
Don't know how well that worked.
It
Hey Hassan,
1) The overhead is pretty small, measured in a small number of milliseconds on
average
2) HDFS is not designed for "online latency". Even though the average is
small, if something "bad happens", your clients might experience a lot of
delays while going through the retry stack. The
This is a very interesting thread to us, as we are thinking about deploying
HDFS as a massive online storage for a on online university, and then
serving the video files to students who want to view them.
We cannot control the size of the videos (and some class work files), as
they will mostly be
If you know how to use AspectJ to do aspect oriented programming. You can
write a aspect class. Let it just monitors the whole process of MapReduce
On Tue, May 18, 2010 at 10:00 AM, Patrick Angeles wrote:
> Should be evident in the total job running time... that's the only metric
> that really ma
Should be evident in the total job running time... that's the only metric
that really matters :)
On Tue, May 18, 2010 at 10:39 AM, Pierre ANCELOT wrote:
> Thank you,
> Any way I can measure the startup overhead in terms of time?
>
>
> On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles >wrote:
>
>
Thank you,
Any way I can measure the startup overhead in terms of time?
On Tue, May 18, 2010 at 4:27 PM, Patrick Angeles wrote:
> Pierre,
>
> Adding to what Brian has said (some things are not explicitly mentioned in
> the HDFS design doc)...
>
> - If you have small files that take up < 64MB you
Pierre,
Adding to what Brian has said (some things are not explicitly mentioned in
the HDFS design doc)...
- If you have small files that take up < 64MB you do not actually use the
entire 64MB block on disk.
- You *do* use up RAM on the NameNode, as each block represents meta-data
that needs to b
Okay, thank you :)
On Tue, May 18, 2010 at 2:48 PM, Brian Bockelman wrote:
>
> On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
>
> > Hi, thanks for this fast answer :)
> > If so, what do you mean by blocks? If a file has to be splitted, it will
> be
> > splitted when larger than 64MB?
> >
>
>
On May 18, 2010, at 7:38 AM, Pierre ANCELOT wrote:
> Hi, thanks for this fast answer :)
> If so, what do you mean by blocks? If a file has to be splitted, it will be
> splitted when larger than 64MB?
>
For every 64MB of the file, Hadoop will create a separate block. So, if you
have a 32KB fil
... and by slices of 64MB then I mean...
?
On Tue, May 18, 2010 at 2:38 PM, Pierre ANCELOT wrote:
> Hi, thanks for this fast answer :)
> If so, what do you mean by blocks? If a file has to be splitted, it will be
> splitted when larger than 64MB?
>
>
>
>
>
> On Tue, May 18, 2010 at 2:34 PM, Bria
Hi, thanks for this fast answer :)
If so, what do you mean by blocks? If a file has to be splitted, it will be
splitted when larger than 64MB?
On Tue, May 18, 2010 at 2:34 PM, Brian Bockelman wrote:
> Hey Pierre,
>
> These are not traditional filesystem blocks - if you save a file smaller
> th
Hey Pierre,
These are not traditional filesystem blocks - if you save a file smaller than
64MB, you don't lose 64MB of file space..
Hadoop will use 32KB to store a 32KB file (ok, plus a KB of metadata or so),
not 64MB.
Brian
On May 18, 2010, at 7:06 AM, Pierre ANCELOT wrote:
> Hi,
> I'm port
Hi,
I'm porting a legacy application to hadoop and it uses a bunch of small
files.
I'm aware that having such small files ain't a good idea but I'm not doing
the technical decisions and the port has to be done for yesterday...
Of course such small files are a problem, loading 64MB blocks for a few
Hi,
Pass the -D property in command line. eg:
Hadoop fs -Ddfs.block.size= .
You can check if its actually set the way you needed by hadoop fs -stat %o
HTH,
Amogh
On 4/14/10 9:01 AM, "Andrew Nguyen" wrote:
I thought I saw a way to specify the block size for individual files
I thought I saw a way to specify the block size for individual files using the
command-line using "hadoop dfs -put/copyFromLocal..." However, I can't seem to
find the reference anywhere.
I see that I can do it via the API but no references to a command-line
mechanism. Am I j
On Thu, Nov 19, 2009 at 11:24 AM, Raymond Jennings III
wrote:
> Can I just change the block size in the config and restart or do I have to
> reformat? It's okay if what is currently in the file system stays at the old
> block size if that's possible ?
>
>
>
&g
Can I just change the block size in the config and restart or do I have to
reformat? It's okay if what is currently in the file system stays at the old
block size if that's possible ?
>
> Cloudera has a pretty detailed blog on this.
>
Indeed. See http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/.
The post is getting a bit long in the tooth but should contain some useful
information for you.
Regards,
Jeff
Replies inline.
On 11/14/09 9:55 PM, "Hrishikesh Agashe"
wrote:
Hi,
Default DFS block size is 64 MB. Does this mean that if I put file less than 64
MB on HDFS, it will not be divided any further?
--Yes, file will be stored in single block per replica.
I have lots and lots if
Hi,
Default DFS block size is 64 MB. Does this mean that if I put file less than 64
MB on HDFS, it will not be divided any further?
I have lots and lots if XMLs and I would like to process them directly.
Currently I am converting them to Sequence files (10 XMLs per sequence file)
and the
79 matches
Mail list logo