Inconsistent state in JobTracker (cdh)
Hi all, we are time to time experiencing a little odd behavior of JobTracker (using cdh release, currently on cdh3u3, but I suppose this affects at least all cdh3 releases so far). What we are seeing is M/R job beeing stuck between map and reduce phase, with 100% maps completed but the web UI reports 1 running map task and since we have**mapred.reduce.slowstart.completed.maps set to 1.0 (because of better throughput of jobs) the reduce phase will never start and the job has to be killed. I have investigated this a bit and I think I have found the reason for this. 12/11/20 01:05:10 INFO mapred.JobInProgress: Task 'attempt_201211011002_1852_m_007638_0' has completed task_201211011002_1852_m_007638 successfully. 12/11/20 01:05:10 WARN hdfs.DFSClient: DataStreamer Exception: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on some output path File does not exist. [Lease. Holder: DFSClient_408514838, pendingcreates: 1] 12/11/20 01:05:10 WARN hdfs.DFSClient: Error Recovery for block blk_-1434919284750099885_670717751 bad datanode[0] nodes == null 12/11/20 01:05:10 WARN hdfs.DFSClient: Could not get block locations. Source file some output path - Aborting... 12/11/20 01:05:10 INFO mapred.JobHistory: Logging failed for job job_201211011002_1852removing PrintWriter from FileManager 12/11/20 01:05:10 ERROR security.UserGroupInformation: PriviledgedActionException as:mapred (auth:SIMPLE) cause:java.io.IOException: java.util.ConcurrentModificationException 12/11/20 01:05:10 INFO ipc.Server: IPC Server handler 7 on 9001, call heartbeat(org.apache.hadoop.mapred.TaskTrackerStatus@1256e5f6, false, false, true, -17988) from 10.2.73.35:44969: error: java.io.IOException: java.util.ConcurrentModificationException When I look to the source code for JobInProgress.completedTask(), I see the log about successful competion of the task, and after that, the logging in HDFS (JobHistory.Task.logFinished()). I suppose that if this call throws an exception (like in the case above), the call to completedTask() is aborted *before* the counters runningMapTasks and finishedMapTasks are updated accordingly. I created a heap dump of the JobTracker and I really found the counter runningMapTasks set to 1 and finishedMapTasks was equal to numMapTasks - 1. Now, the question is, should this be handled in the JobTracker (say by moving the logging code after the counter manipulation)? Or should the TaskTracker re-report the completed task on error in JobTracker? What can cause the LeaseExpiredException? Should a JIRA be filled? :) Thanks for comments, Jan
Start time, end time, and task tracker of individual tasks of a job
Hello, Is there a way to obtain the information of each individual task of a map-reduce job, including start time, end time, which task tracker runs this task and so on? I know this information can be found through the web interface running on the jobtracter. But is it possible to redirect the information to a nicely formatted log file for each job? By the way, I'm running hadoop 0.20.2-cdh3u5. Thanks advance for the help. Cheers Jeff
block size
Guys, After changing property of block size from 64 to 128Mb, will I need to re-import data or will running hadoop balancer will resize blocks in hdfs? Thanks, AK NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
RE: block size
Cheers! From: Kai Voigt [mailto:k...@123.org] Sent: Tuesday, November 20, 2012 11:34 AM To: user@hadoop.apache.org Subject: Re: block size Hi, Am 20.11.2012 um 17:31 schrieb Kartashov, Andy andy.kartas...@mpac.camailto:andy.kartas...@mpac.ca: After changing property of block size from 64 to 128Mb, will I need to re-import data or will running hadoop balancer will resize blocks in hdfs? the blocksize affects new files only, existing files will not be modified. As you said, you need to re-import those old files if you want to store them with the new blocksize. Kai -- Kai Voigt k...@123.orgmailto:k...@123.org NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: Start time, end time, and task tracker of individual tasks of a job
Hey Jeff, Yes, we expose some information for each task completion event.. For Old API, use RunningJob, specifically: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapred/RunningJob.html#getTaskCompletionEvents(int) For New API, use Job, specifically: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/Job.html#getTaskCompletionEvents(int) On Tue, Nov 20, 2012 at 9:43 PM, Jeff LI uniquej...@gmail.com wrote: Hello, Is there a way to obtain the information of each individual task of a map-reduce job, including start time, end time, which task tracker runs this task and so on? I know this information can be found through the web interface running on the jobtracter. But is it possible to redirect the information to a nicely formatted log file for each job? By the way, I'm running hadoop 0.20.2-cdh3u5. Thanks advance for the help. Cheers Jeff -- Harsh J
Re: debugging hadoop streaming programs (first code)
The mapreduce webUI gives you all the information you need for debugging you code. Depending on where your JobTracker is, you should go hit $JT_HOST_NAME:50030. And check the job link as well the task, taskattempt and logs pages. HTH +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Nov 20, 2012, at 5:33 AM, jamal sasha wrote: Hi, If I just use pipes, then the code runs just fine.. the issue is when I deploy it on clusters... :( Any suggestions on how to debug it. On Tue, Nov 20, 2012 at 7:42 AM, Mahesh Balija balijamahesh@gmail.com wrote: Hi Jamal, You can debug your MapReduce program if it is written in java code, by running your MR job in LocalRunner mode via eclipse. Or even you can have some debug statements (or even S.O.Ps) written in your code so that you can check where your job fails. But I am NOT sure for python, but one suggestion is can you run your Python code (Map unit reduce unit) locally on your input data and see whether your logic has any issues. Best, Mahesh Balija, Calsoft Labs. On Tue, Nov 20, 2012 at 6:50 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, This is my first attempt to learn the map reduce abstraction. My problem is as follows I have a text file as follows: id 1, id2, date,time,mrps,code,code2 3710100022400,1350219887, 2011-09-10, 12:39:38.000, 99.00, 1, 0 3710100022400, 5045462785, 2011-09-06, 13:23:00.000, 70.63, 1, 0 Now what I want is to do is to count the number of transaction happening in every half an hour between 7 am and 11 am. So here are the intervals. 7-7:30 -0 7:30-8 - 1 8-8:30-2 10:30-11-7 So ultimately what I am doing is creating a 2d dictionary d[id2][interval] = count_transactions. My mappers and reducers are attached (sample input also). The code run just fine if i run via cat input.txt | python mapper.py | sort | python reducer.py Gives me the output but when i run it on clusters.. it throws an error which is not helpful (basically on the terminal it says job unsuccesful reason NA). Any suggestion on what am i doing wrong. Jamal signature.asc Description: Message signed with OpenPGP using GPGMail
Re: Start time, end time, and task tracker of individual tasks of a job
Most of this information is already available in the JobHistory files. And there are parsers to read from these files. HTH +Vinod Kumar Vavilapalli Hortonworks Inc. http://hortonworks.com/ On Nov 20, 2012, at 8:13 AM, Jeff LI wrote: Hello, Is there a way to obtain the information of each individual task of a map-reduce job, including start time, end time, which task tracker runs this task and so on? I know this information can be found through the web interface running on the jobtracter. But is it possible to redirect the information to a nicely formatted log file for each job? By the way, I'm running hadoop 0.20.2-cdh3u5. Thanks advance for the help. Cheers Jeff signature.asc Description: Message signed with OpenPGP using GPGMail
number of reducers
Hi, I wrote a simple map reduce job in hadoop streaming. I am wondering if I am doing something wrong .. While number of mappers are projected to be around 1700.. reducers.. just 1? It’s couple of TB’s worth of data. What can I do to address this. Basically mapper looks like this For line in sys.stdin: Print line Reducer For line in sys.stdin: New_line = process_line(line) Print new_line Thanks
Re: number of reducers
Awesome thanks . Works great now On Tuesday, November 20, 2012, Bejoy KS bejoy.had...@gmail.com wrote: Hi Sasha By default the number or reducers are set to be 1. If you want more you need to specify it as hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ... Regards Bejoy KS Sent from handheld, please excuse typos. From: jamal sasha jamalsha...@gmail.com Date: Tue, 20 Nov 2012 14:38:54 -0500 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: number of reducers Hi, I wrote a simple map reduce job in hadoop streaming. I am wondering if I am doing something wrong .. While number of mappers are projected to be around 1700.. reducers.. just 1? It’s couple of TB’s worth of data. What can I do to address this. Basically mapper looks like this For line in sys.stdin: Print line Reducer For line in sys.stdin: New_line = process_line(line) Print new_line Thanks
RE: number of reducers
I specify mine inside mapred-site.xml property namemapred.reduce.tasks/name value20/value /property Rgds, AK47 From: Bejoy KS [mailto:bejoy.had...@gmail.com] Sent: Tuesday, November 20, 2012 3:10 PM To: user@hadoop.apache.org Subject: Re: number of reducers Hi Sasha By default the number or reducers are set to be 1. If you want more you need to specify it as hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ... Regards Bejoy KS Sent from handheld, please excuse typos. From: jamal sasha jamalsha...@gmail.com Date: Tue, 20 Nov 2012 14:38:54 -0500 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: number of reducers Hi, I wrote a simple map reduce job in hadoop streaming. I am wondering if I am doing something wrong .. While number of mappers are projected to be around 1700.. reducers.. just 1? It's couple of TB's worth of data. What can I do to address this. Basically mapper looks like this For line in sys.stdin: Print line Reducer For line in sys.stdin: New_line = process_line(line) Print new_line Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: number of reducers
What is the relationship between number of reducers and cpu cores in your setup? I read somewhere that it must be .5 of number of cpu cores. Thanks. Alex. -Original Message- From: Kartashov, Andy andy.kartas...@mpac.ca To: user user@hadoop.apache.org; bejoy.hadoop bejoy.had...@gmail.com Sent: Tue, Nov 20, 2012 1:51 pm Subject: RE: number of reducers I specify mine inside mapred-site.xml property namemapred.reduce.tasks/name value20/value /property Rgds, AK47 From: Bejoy KS [mailto:bejoy.had...@gmail.com] Sent: Tuesday, November 20, 2012 3:10 PM To: user@hadoop.apache.org Subject: Re: number of reducers Hi Sasha By default the number or reducers are set to be 1. If you want more you need to specify it as hadoop jar myJar.jar myClass -D mapred.reduce.tasks=20 ... Regards Bejoy KS Sent from handheld, please excuse typos. From: jamal sasha jamalsha...@gmail.com Date: Tue, 20 Nov 2012 14:38:54 -0500 To: user@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: number of reducers Hi, I wrote a simple map reduce job in hadoop streaming. I am wondering if I am doing something wrong .. While number of mappers are projected to be around 1700.. reducers.. just 1? It’s couple of TB’s worth of data. What can I do to address this. Basically mapper looks like this For line in sys.stdin: Print line Reducer For line in sys.stdin: New_line = process_line(line) Print new_line Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
problem with upgrading from HDFS 0.21 to HDFS 1.0.4
hi all, It seems it's not supported to upgrade hadoop from 0.21 to the stable version 1.0.4 . The 'linkBlocks' function in the 'DataStorage.java'(v1.0.4) can not work well , because the datanode storage structure of the former is different from the latter ,there are finalized and rbw directorys under $dfs.datanode.data.dir/current. Do you have some suggestions to deal with this problem? 2012-11-20 rongshen.long
Re: number of reducers
Hey Jamal, I'd recommend first going over the whole tutorial to get a good grip on how Hadoop MR is designed to work: http://hadoop.apache.org/docs/stable/mapred_tutorial.html On Wed, Nov 21, 2012 at 1:08 AM, jamal sasha jamalsha...@gmail.com wrote: Hi, I wrote a simple map reduce job in hadoop streaming. I am wondering if I am doing something wrong .. While number of mappers are projected to be around 1700.. reducers.. just 1? It’s couple of TB’s worth of data. What can I do to address this. Basically mapper looks like this For line in sys.stdin: Print line Reducer For line in sys.stdin: New_line = process_line(line) Print new_line Thanks -- Harsh J
Supplying a jar for a map-reduce job
Hi, I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar to use for running the map-reduce job using the setJarByClass function on org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR job at compile. What I want is to be able to run an MR job without being dependent on it at compile time. It would be great if I could use a jar that contains the Mapper and Reducer classes and just pass it to run the map reduce job. That would make it easy to choose an MR job to run at runtime. Is that possible? Thanks in Advance, Pankaj
Re: Supplying a jar for a map-reduce job
Hi Pankaj AFAIK You can do the same. Just provide the properties like mapper class, reducer class, input format, output format etc using -D option at run time. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Pankaj Gupta pan...@brightroll.com Date: Tue, 20 Nov 2012 20:49:29 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Supplying a jar for a map-reduce job Hi, I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar to use for running the map-reduce job using the setJarByClass function on org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR job at compile. What I want is to be able to run an MR job without being dependent on it at compile time. It would be great if I could use a jar that contains the Mapper and Reducer classes and just pass it to run the map reduce job. That would make it easy to choose an MR job to run at runtime. Is that possible? Thanks in Advance, Pankaj
ISSUE while configuring ECLIPSE with MAP-REDUCE
Hi Hadoop Champs, I am facing this issue while trying to configure Eclipse with Map-Reduce. Exception in thread main java.lang.Error: Unresolved compilation problems: The method setInputFormat(Class? extends InputFormat) in the type JobConf is not applicable for the arguments (ClassTextInputFormat) The method setOutputFormat(Class? extends OutputFormat) in the type JobConf is not applicable for the arguments (ClassTextOutputFormat) The method setInputPaths(Job, String) in the type FileInputFormat is not applicable for the arguments (JobConf, Path) The method setOutputPath(Job, Path) in the type FileOutputFormat is not applicable for the arguments (JobConf, Path) at TestDriver.main(TestDriver.java:30) I have these classes and flow pattern. import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter; public class TestDriver { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(TestDriver.class); // TODO: specify output types conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); // TODO: specify input and output DIRECTORIES (not files) //conf.setInputPath(new Path(src)); //conf.setOutputPath(new Path(out)); conf.setInputFormat(TextInputFormat.class); /* ERROR shown is :: The method setInputFormat(Class? extends InputFormat) in the type JobConf is not applicable for the arguments (ClassTextInputFormat) */ conf.setOutputFormat(TextOutputFormat.class); /* ERROR shown is :: The method setOutputFormat(Class? extends OutputFormat) in the type JobConf is not applicable for the arguments (ClassTextOutputFormat) */ FileInputFormat.setInputPaths(conf, new Path(In)); /* ERROR shown is :: The method setInputPaths(Job, String) in the type FileInputFormat is not applicable for the arguments (JobConf, Path) */ FileOutputFormat.setOutputPath(conf, new Path(Out)); /* ERROR shown is :: The method setOutputPath(Job, Path) in the type FileOutputFormat is not applicable for the arguments (JobConf, Path) */ // TODO: specify a mapper conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); // TODO: specify a reducer conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } } Please suggest Help Thanks Regards Yogesh Kumar
RE: ISSUE while configuring ECLIPSE with MAP-REDUCE
I am using Apache Hadoop-0.20.2 Regards Yogesh Kumar From: yogeshdh...@live.com To: user@hadoop.apache.org Subject: ISSUE while configuring ECLIPSE with MAP-REDUCE Date: Wed, 21 Nov 2012 11:17:42 +0530 Hi Hadoop Champs, I am facing this issue while trying to configure Eclipse with Map-Reduce. Exception in thread main java.lang.Error: Unresolved compilation problems: The method setInputFormat(Class? extends InputFormat) in the type JobConf is not applicable for the arguments (ClassTextInputFormat) The method setOutputFormat(Class? extends OutputFormat) in the type JobConf is not applicable for the arguments (ClassTextOutputFormat) The method setInputPaths(Job, String) in the type FileInputFormat is not applicable for the arguments (JobConf, Path) The method setOutputPath(Job, Path) in the type FileOutputFormat is not applicable for the arguments (JobConf, Path) at TestDriver.main(TestDriver.java:30) I have these classes and flow pattern. import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapred.JobClient; import org.apache.hadoop.mapred.JobConf; import org.apache.hadoop.mapred.Mapper; import org.apache.hadoop.mapred.Reducer; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter; public class TestDriver { public static void main(String[] args) { JobClient client = new JobClient(); JobConf conf = new JobConf(TestDriver.class); // TODO: specify output types conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); // TODO: specify input and output DIRECTORIES (not files) //conf.setInputPath(new Path(src)); //conf.setOutputPath(new Path(out)); conf.setInputFormat(TextInputFormat.class); /* ERROR shown is :: The method setInputFormat(Class? extends InputFormat) in the type JobConf is not applicable for the arguments (ClassTextInputFormat) */ conf.setOutputFormat(TextOutputFormat.class); /* ERROR shown is :: The method setOutputFormat(Class? extends OutputFormat) in the type JobConf is not applicable for the arguments (ClassTextOutputFormat) */ FileInputFormat.setInputPaths(conf, new Path(In)); /* ERROR shown is :: The method setInputPaths(Job, String) in the type FileInputFormat is not applicable for the arguments (JobConf, Path) */ FileOutputFormat.setOutputPath(conf, new Path(Out)); /* ERROR shown is :: The method setOutputPath(Job, Path) in the type FileOutputFormat is not applicable for the arguments (JobConf, Path) */ // TODO: specify a mapper conf.setMapperClass(org.apache.hadoop.mapred.lib.IdentityMapper.class); // TODO: specify a reducer conf.setReducerClass(org.apache.hadoop.mapred.lib.IdentityReducer.class); client.setConf(conf); try { JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace(); } } } Please suggest Help Thanks Regards Yogesh Kumar
Is there an additional overhead when storing data in HDFS?
Hi All I'm wondering if there is an additional overhead when storing some data into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB to store it? If it takes more than 4GB space, why? ThanksRamon
Re: Is there an additional overhead when storing data in HDFS?
HDFS uses 4GB for the file + checksum data. Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this case additional 32MB data. On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.com wrote: Hi All I'm wondering if there is an additional overhead when storing some data into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB to store it? If it takes more than 4GB space, why? Thanks Ramon -- http://hortonworks.com/download/
RE: Is there an additional overhead when storing data in HDFS?
Thanks, besides the checksum data is there anything else? Data in name node? Date: Tue, 20 Nov 2012 23:14:06 -0800 Subject: Re: Is there an additional overhead when storing data in HDFS? From: sur...@hortonworks.com To: user@hadoop.apache.org HDFS uses 4GB for the file + checksum data. Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this case additional 32MB data. On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.com wrote: Hi All I'm wondering if there is an additional overhead when storing some data into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB to store it? If it takes more than 4GB space, why? Thanks Ramon -- http://hortonworks.com/download/
Re: Is there an additional overhead when storing data in HDFS?
Hello Ramon, Why don't you go through this link once : http://www.aosabook.org/en/hdfs.html Suresh and guys have explained everything beautifully. HTH Regards, Mohammad Tariq On Wed, Nov 21, 2012 at 12:58 PM, Suresh Srinivas sur...@hortonworks.comwrote: Namenode will have trivial amount of data stored in journal/fsimage. On Tue, Nov 20, 2012 at 11:21 PM, WangRamon ramon_w...@hotmail.comwrote: Thanks, besides the checksum data is there anything else? Data in name node? -- Date: Tue, 20 Nov 2012 23:14:06 -0800 Subject: Re: Is there an additional overhead when storing data in HDFS? From: sur...@hortonworks.com To: user@hadoop.apache.org HDFS uses 4GB for the file + checksum data. Default is for every 512 bytes of data, 4 bytes of checksum are stored. In this case additional 32MB data. On Tue, Nov 20, 2012 at 11:00 PM, WangRamon ramon_w...@hotmail.comwrote: Hi All I'm wondering if there is an additional overhead when storing some data into HDFS? For example, I have a 2GB file, the replicate factor of HDSF is 2, when the file is uploaded to HDFS, should HDFS use 4GB to store it or more then 4GB to store it? If it takes more than 4GB space, why? Thanks Ramon -- http://hortonworks.com/download/ -- http://hortonworks.com/download/