Help with adjusting Hadoop configuration files
Hi Everyone, We are a start-up company has been using the Hadoop Cluster platform (version 0.20.2) on Amazon EC2 environment. We tried to setup a cluster using two different forms: Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines are small EC2 instances (1.6 GB RAM) Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a small EC2 instance and the other two datanodes are large EC2 instances (7.5 GB RAM) We tried to make changes on the the configuration files (core-sit, hdfs-site and mapred-sit xml files) and we expected to see a significant improvement on the performance of the cluster 2, unfortunately this has yet to happen. Are there any special parameters on the configuration files that we need to change in order to adjust the Hadoop to a large hardware environment ? Are there any best practice you recommend? Thanks in advance. Avi
Re: Setting up a Single Node Hadoop Cluster
What is the log content? Its the best place to see whats going wrong . If you give the logs then its easy point out the problem On Tue, Jun 21, 2011 at 9:06 AM, Kumar Kandasami kumaravel.kandas...@gmail.com wrote: Hi Ziyad: Do you see any errors on the log file ? I have installed CDH3 in the past on Ubuntu machines using the two links below: https://ccp.cloudera.com/display/CDHDOC/Before+You+Install+CDH3+on+a+Single+Node https://ccp.cloudera.com/display/CDHDOC/Before+You+Install+CDH3+on+a+Single+Node%20 https://ccp.cloudera.com/display/CDHDOC/Installing+CDH3+on+a+Single+Linux+Node+in+Pseudo-distributed+Mode Also the blog link below explains how to install using tarball files that works on my Ubuntu too (even though it is explained for Mac). http://knowledgedonor.blogspot.com/2011/05/installing-cloudera-hadoop-hadoop-0202.html Hope these links help you proceed further. Kumar_/|\_ www.saisk.com ku...@saisk.com making a profound difference with knowledge and creativity... On Mon, Jun 20, 2011 at 6:22 PM, Ziyad Mir ziyad...@gmail.com wrote: Hi, I have been attempting to set up a single node Hadoop cluster (by following http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ ) on my personal computer (running Ubuntu 10.10), however, I have run into some roadblocks. Specifically, there appear to be issues starting the required Hadoop processes after running 'bin/hadoop/start-all.sh' (jps only returns itself). In addition, if I run 'bin/hadoop/stop-all.sh', I often see 'no namenode to stop, no jobtracker to stop'. I have attempted looking into the hadoop/log files, however, I'm not sure what specifically I am looking for. Any suggestions would be much appreciated. Thanks, Ziyad
Re: Help with adjusting Hadoop configuration files
If u reduce the default block size of dfs(which is in the configuration file) and if u use default inputformat it creates more no of mappers at a time which may help you to effectively use the RAM.. Another way is create as many parallel jobs as possible at pro grammatically so that uses all available RAM. On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Madhu, First of all, thanks for the quick reply. I've searched the net about the properties of the configuration files and I specifically wanted to know if there is a property that is related to memory tuning (as you can see I have 7.5 RAM on each datanode and I really want to use it properly). Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10 (number of cores on the datanodes) and unfortunately I haven't seen any change on the performance or time duration of running jobs. Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:33 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files The utilization of cluster depends upon the no of jobs and no of mappers and reducers.The configuration files only help u set up the cluster by specifying info .u can also specify some of details like block size and replication in configuration files which may help you in job management.You can read all the available configuration properties here http://hadoop.apache.org/common/docs/current/cluster_setup.html On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Everyone, We are a start-up company has been using the Hadoop Cluster platform (version 0.20.2) on Amazon EC2 environment. We tried to setup a cluster using two different forms: Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines are small EC2 instances (1.6 GB RAM) Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a small EC2 instance and the other two datanodes are large EC2 instances (7.5 GB RAM) We tried to make changes on the the configuration files (core-sit, hdfs-site and mapred-sit xml files) and we expected to see a significant improvement on the performance of the cluster 2, unfortunately this has yet to happen. Are there any special parameters on the configuration files that we need to change in order to adjust the Hadoop to a large hardware environment ? Are there any best practice you recommend? Thanks in advance. Avi - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
Re: Starting an HDFS node (standalone) programmatically by API
HDFS should be available to DataNodes in order to run the jobs and bin/hdfs just uses the hadoop jobs to access hdfs in datanodes .So if u want read a file from hdfs inside a job you have to start data nodes when cluster comes up. On Fri, Jun 17, 2011 at 4:12 PM, punisher punishe...@hotmail.it wrote: Hi all, hdfs nodes can be started using the sh scripts provided with hadoop. I read that it's all based on script files is it possible to start an HDFS (standalone) from a java application by API? Thanks -- View this message in context: http://hadoop-common.472056.n3.nabble.com/Starting-an-HDFS-node-standalone-programmatically-by-API-tp3075693p3075693.html Sent from the Users mailing list archive at Nabble.com.
Re: HDFS File Appending
HDFS doesnot support Appending i think . I m not sure about pig , if you are using Hadoop directly you can zip the files and use zip as the input the jobs. On Fri, Jun 17, 2011 at 6:56 AM, Xiaobo Gu guxiaobo1...@gmail.com wrote: please refer to FileUtil.CopyMerge On Fri, Jun 17, 2011 at 8:33 AM, jagaran das jagaran_...@yahoo.co.in wrote: Hi, We have a requirement where There would be huge number of small files to be pushed to hdfs and then use pig to do analysis. To get around the classic Small File Issue we merge the files and push a bigger file in to HDFS. But we are loosing time in this merging process of our pipeline. But If we can directly append to an existing file in HDFS we can save this Merging Files time. Can you please suggest if there a newer stable version of Hadoop where can go for appending ? Thanks and Regards, Jagaran
Re: Heap Size is 27.25 MB/888.94 MB
Its related with the amount of memory available to Java Virtual machine that is created for hadoop jobs. On Fri, Jun 17, 2011 at 1:18 AM, Harsh J ha...@cloudera.com wrote: The 'heap size' is a Java/program and memory (RAM) thing; unrelated to physical disk space that the HDFS may occupy (which can be seen in configured capacity). More reading on what a Java heap size is about: http://en.wikipedia.org/wiki/Java_Virtual_Machine#Heap On Fri, Jun 17, 2011 at 1:07 AM, jeff.schm...@shell.com wrote: So its saying my heap size is (Heap Size is 27.25 MB/888.94 MB) but my configured capacity is 971GB (4 nodes) Is heap size on the main page just for the namenode or do I need to increase it to include the datanodes Cheers - Jeffery Schmitz Projects and Technology 3737 Bellaire Blvd Houston, Texas 77001 Tel: +1-713-245-7326 Fax: +1 713 245 7678 Email: jeff.schm...@shell.com mailto:jeff.schm...@shell.com TK-421, why aren't you at your post? -- Harsh J
Re: ClassNotFoundException while running quick start guide on Windows.
I think the jar have some issuses where its not able to read the Main class from manifest . try unjar the jar and see in Manifest.xml what is the main class and then run as follows bin/hadoop jar hadoop-*-examples.jar Full qualified main class grep input output 'dfs[a-z.]+' On Thu, Jun 16, 2011 at 10:23 AM, Drew Gross drew.a.gr...@gmail.com wrote: Hello, I'm trying to run the example from the quick start guide on Windows and I get this error: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Exception in thread main java.lang.NoClassDefFoundError: Caused by: java.lang.ClassNotFoundException: at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: . Program will exit. Exception in thread main java.lang.NoClassDefFoundError: Gross\Documents\Projects\discom\hadoop-0/21/0\logs Caused by: java.lang.ClassNotFoundException: Gross\Documents\Projects\discom\hadoop-0.21.0\logs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: Gross\Documents\Projects\discom\hadoop-0.21.0\logs. Program will exit. Does anyone know what I need to change? Thank you. From, Drew -- Forget the environment. Print this e-mail immediately. Then burn it.
Re: Handling external jars in EMR
Its better to merge the library with ur code . Other wise u have to copy the library to every lib folder of HADOOP in every node cluster. libjars is not working for me also . I used maven shade plugin (eclipse) to get the merged jar. On Wed, Jun 15, 2011 at 12:20 AM, Mehmet Tepedelenlioglu mehmets...@gmail.com wrote: I am using the Guava library in my hadoop code through a jar file. With hadoop one has the -libjars option (although I could not get that working on 0.2 for some reason). Are there any easy options with EMR short of using a utility like jarjar or bootstrapping magic. Or is that what I'll need to do? Thanks, Mehmet T.
Re: Hadoop Runner
Define Ur own custom Record Reader and its efficient . On Sun, Jun 12, 2011 at 10:12 AM, Harsh J ha...@cloudera.com wrote: Mark, I may not have gotten your question exactly, but you can do further processing inside of your FileInputFormat derivative's RecordReader implementation (just before it loads the value for a next() form of call -- which the MapRunner would use to read). If you're looking to dig into Hadoop's source code to understand the flow yourself, MapTask.java is what you may be looking for (run* methods). On Sun, Jun 12, 2011 at 3:25 AM, Mark question markq2...@gmail.com wrote: Hi, 1) Where can I find the main class of hadoop? The one that calls the InputFormat then the MapperRunner and ReducerRunner and others? This will help me understand what is in memory or still on disk , exact flow of data between split and mappers . My problem is, assuming I have a TextInputFormat and would like to modify the input in memory before being read by RecordReader... where shall I do that? InputFormat was my first guess, but unfortunately, it only defines the logical splits ... So, the only way I can think of is use the recordReader to read all the records in split into another variable (with the format I want) then process that variable by map functions. But is that efficient? So, to understand this,I hope someone can give an answer to Q(1) Thank you, Mark -- Harsh J
Re: Append to Existing File
When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric
RE: Help with adjusting Hadoop configuration files
Hi, The block size is configured to 128MB, I've read that it is recommended to increase it in order to get better performance. What value do you recommend to set it ? Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:54 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files If u reduce the default block size of dfs(which is in the configuration file) and if u use default inputformat it creates more no of mappers at a time which may help you to effectively use the RAM.. Another way is create as many parallel jobs as possible at pro grammatically so that uses all available RAM. On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Madhu, First of all, thanks for the quick reply. I've searched the net about the properties of the configuration files and I specifically wanted to know if there is a property that is related to memory tuning (as you can see I have 7.5 RAM on each datanode and I really want to use it properly). Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10 (number of cores on the datanodes) and unfortunately I haven't seen any change on the performance or time duration of running jobs. Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:33 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files The utilization of cluster depends upon the no of jobs and no of mappers and reducers.The configuration files only help u set up the cluster by specifying info .u can also specify some of details like block size and replication in configuration files which may help you in job management.You can read all the available configuration properties here http://hadoop.apache.org/common/docs/current/cluster_setup.html On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Everyone, We are a start-up company has been using the Hadoop Cluster platform (version 0.20.2) on Amazon EC2 environment. We tried to setup a cluster using two different forms: Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines are small EC2 instances (1.6 GB RAM) Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a small EC2 instance and the other two datanodes are large EC2 instances (7.5 GB RAM) We tried to make changes on the the configuration files (core-sit, hdfs-site and mapred-sit xml files) and we expected to see a significant improvement on the performance of the cluster 2, unfortunately this has yet to happen. Are there any special parameters on the configuration files that we need to change in order to adjust the Hadoop to a large hardware environment ? Are there any best practice you recommend? Thanks in advance. Avi - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11 - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
Re: Append to Existing File
Please refer to this discussion http://search-hadoop.com/m/rnG0h1zCZcL1/Re%253A+HDFS+File+Appending+URGENTsubj=Fw+HDFS+File+Appending+URGENT On Tue, Jun 21, 2011 at 4:23 PM, Eric Charles eric.char...@u-mangate.comwrote: When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/**jira/browse/HDFS-265https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in** wrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran __**__ From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric
Re: Help with adjusting Hadoop configuration files
Yeah it will increase performance by reducing number of mappers and making single mapper to use more memory . So the value depends upon the application and RAM available . For your use case i think 512MB- 1GB will be better value. On Tue, Jun 21, 2011 at 4:28 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi, The block size is configured to 128MB, I've read that it is recommended to increase it in order to get better performance. What value do you recommend to set it ? Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:54 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files If u reduce the default block size of dfs(which is in the configuration file) and if u use default inputformat it creates more no of mappers at a time which may help you to effectively use the RAM.. Another way is create as many parallel jobs as possible at pro grammatically so that uses all available RAM. On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Madhu, First of all, thanks for the quick reply. I've searched the net about the properties of the configuration files and I specifically wanted to know if there is a property that is related to memory tuning (as you can see I have 7.5 RAM on each datanode and I really want to use it properly). Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10 (number of cores on the datanodes) and unfortunately I haven't seen any change on the performance or time duration of running jobs. Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:33 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files The utilization of cluster depends upon the no of jobs and no of mappers and reducers.The configuration files only help u set up the cluster by specifying info .u can also specify some of details like block size and replication in configuration files which may help you in job management.You can read all the available configuration properties here http://hadoop.apache.org/common/docs/current/cluster_setup.html On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Everyone, We are a start-up company has been using the Hadoop Cluster platform (version 0.20.2) on Amazon EC2 environment. We tried to setup a cluster using two different forms: Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines are small EC2 instances (1.6 GB RAM) Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a small EC2 instance and the other two datanodes are large EC2 instances (7.5 GB RAM) We tried to make changes on the the configuration files (core-sit, hdfs-site and mapred-sit xml files) and we expected to see a significant improvement on the performance of the cluster 2, unfortunately this has yet to happen. Are there any special parameters on the configuration files that we need to change in order to adjust the Hadoop to a large hardware environment ? Are there any best practice you recommend? Thanks in advance. Avi - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11 - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
Re: one-to-many Map Side Join without reducer
I think HIVE is best suited for ur use case where it gives you the sql based interface to the hadoop to make these type of things. On Fri, Jun 10, 2011 at 2:39 AM, Shi Yu sh...@uchicago.edu wrote: Hi, I have two datasets: dataset 1 has the format: MasterKey1SubKey1SubKey2SubKey3 MasterKey2Subkey4 Subkey5 Subkey6 dataset 2 has the format: SubKey1Value1 SubKey2Value2 ... I want to have one-to-many join based on the SubKey, and the final goal is to have an output like: MasterKey1Value1Value2Value3 MasterKey2Value4Value5Value6 ... After studying and experimenting some example code, I understand that it is doable if I transform the first data set as SubKey1MasterKey1 SubKey2MasterKey1 SubKey3MasterKey1 SubKey4MasterKey2 SubKey5MasterKey2 SubKey6MasterKey2 then using the inner join with the dataset 2 on SubKey. Then I probably need a reducer to perform secondary sort on MasterKey to get the result. However, the bottleneck is still on the reducer if each MasterKey has lots of SubKey. My question is, suppose that dataset2 contains all the Subkeys and never split, is it possible to join the key of dataset 2 with multiple values of dataset 1 at the Mapper Side? Any hint is highly appreciated. Shi
Re: Running Back to Back Map-reduce jobs
You can use ControlledJob's addDependingJob to handle dependency between multiple jobs. On Tue, Jun 7, 2011 at 4:15 PM, Adarsh Sharma adarsh.sha...@orkash.comwrote: Harsh J wrote: Yes, I believe Oozie does have Pipes and Streaming action helpers as well. On Thu, Jun 2, 2011 at 5:05 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: Ok, Is it valid for running jobs through Hadoop Pipes too. Thanks Harsh J wrote: Oozie's workflow feature may exactly be what you're looking for. It can also do much more than just chain jobs. Check out additional features at: http://yahoo.github.com/oozie/ On Thu, Jun 2, 2011 at 4:48 PM, Adarsh Sharma adarsh.sha...@orkash.com wrote: After following the below points, I am confused about the examples used in the documentation : http://yahoo.github.com/oozie/**releases/3.0.0/** WorkflowFunctionalSpec.html#**a3.2.2.3_Pipeshttp://yahoo.github.com/oozie/releases/3.0.0/WorkflowFunctionalSpec.html#a3.2.2.3_Pipes What I want to achieve is to terminate a job after my permission i.e if I want to run again a map-reduce job after the completion of one , it runs then terminates after my code execution. I struggled to find a simple example that proves this concept. In the Oozie documentation, they r just setting parameters and use them. fore.g a simple Hadoop Pipes job is executed by : int main(int argc, char *argv[]) { return HadoopPipes::runTask(**HadoopPipes::TemplateFactory** WordCountMap, WordCountReduce()); } Now if I want to run another job after this on the reduced data in HDFS, how this could be possible. Do i need to add some code. Thanks Dear all, I ran several map-reduce jobs in Hadoop Cluster of 4 nodes. Now this time I want a map-reduce job to be run again after one. Fore.g to clear my point, suppose a wordcount is run on gutenberg file in HDFS and after completion 11/06/02 15:14:35 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 11/06/02 15:14:35 INFO mapred.FileInputFormat: Total input paths to process : 3 11/06/02 15:14:36 INFO mapred.JobClient: Running job: job_201106021143_0030 11/06/02 15:14:37 INFO mapred.JobClient: map 0% reduce 0% 11/06/02 15:14:50 INFO mapred.JobClient: map 33% reduce 0% 11/06/02 15:14:59 INFO mapred.JobClient: map 66% reduce 11% 11/06/02 15:15:08 INFO mapred.JobClient: map 100% reduce 22% 11/06/02 15:15:17 INFO mapred.JobClient: map 100% reduce 100% 11/06/02 15:15:25 INFO mapred.JobClient: Job complete: job_201106021143_0030 11/06/02 15:15:25 INFO mapred.JobClient: Counters: 18 Again a map-reduce job is started on the output or original data say again 1/06/02 15:14:36 INFO mapred.JobClient: Running job: job_201106021143_0030 11/06/02 15:14:37 INFO mapred.JobClient: map 0% reduce 0% 11/06/02 15:14:50 INFO mapred.JobClient: map 33% reduce 0% Is it possible or any parameters to achieve it. Please guide . Thanks
Re: Append to Existing File
Hi Madhu, Tks for the pointer. Even after reading the section on 0.21/22/23 written by Tsz-Wo, I still remain in the fog... Will HDFS-265 (and its mentioned Jiras) provide a solution for append (whatever the release it will be)? Another way of asking is: Are there today other Jiras than the ones mentioned on HDFS-265 to take into consideration to have working hadoop append?. Tks, Eric On 21/06/11 12:58, madhu phatak wrote: Please refer to this discussion http://search-hadoop.com/m/rnG0h1zCZcL1/Re%253A+HDFS+File+Appending+URGENTsubj=Fw+HDFS+File+Appending+URGENT On Tue, Jun 21, 2011 at 4:23 PM, Eric Charleseric.char...@u-mangate.comwrote: When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/**jira/browse/HDFS-265https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in** wrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran __**__ From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric -- Eric
RE: Help with adjusting Hadoop configuration files
Thanks Madhu, I'll check it. -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 2:02 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files Yeah it will increase performance by reducing number of mappers and making single mapper to use more memory . So the value depends upon the application and RAM available . For your use case i think 512MB- 1GB will be better value. On Tue, Jun 21, 2011 at 4:28 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi, The block size is configured to 128MB, I've read that it is recommended to increase it in order to get better performance. What value do you recommend to set it ? Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:54 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files If u reduce the default block size of dfs(which is in the configuration file) and if u use default inputformat it creates more no of mappers at a time which may help you to effectively use the RAM.. Another way is create as many parallel jobs as possible at pro grammatically so that uses all available RAM. On Tue, Jun 21, 2011 at 3:17 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Madhu, First of all, thanks for the quick reply. I've searched the net about the properties of the configuration files and I specifically wanted to know if there is a property that is related to memory tuning (as you can see I have 7.5 RAM on each datanode and I really want to use it properly). Also, I've changed the mapred.tasktracker.reduce/map.tasks.maximum to 10 (number of cores on the datanodes) and unfortunately I haven't seen any change on the performance or time duration of running jobs. Avi -Original Message- From: madhu phatak [mailto:phatak@gmail.com] Sent: Tuesday, June 21, 2011 12:33 PM To: common-user@hadoop.apache.org Subject: Re: Help with adjusting Hadoop configuration files The utilization of cluster depends upon the no of jobs and no of mappers and reducers.The configuration files only help u set up the cluster by specifying info .u can also specify some of details like block size and replication in configuration files which may help you in job management.You can read all the available configuration properties here http://hadoop.apache.org/common/docs/current/cluster_setup.html On Tue, Jun 21, 2011 at 2:13 PM, Avi Vaknin avivakni...@gmail.com wrote: Hi Everyone, We are a start-up company has been using the Hadoop Cluster platform (version 0.20.2) on Amazon EC2 environment. We tried to setup a cluster using two different forms: Cluster 1: includes 1 master (namenode) + 5 datanodes - all of the machines are small EC2 instances (1.6 GB RAM) Cluster 2: includes 1 master (namenode) + 2 datanodes - the master is a small EC2 instance and the other two datanodes are large EC2 instances (7.5 GB RAM) We tried to make changes on the the configuration files (core-sit, hdfs-site and mapred-sit xml files) and we expected to see a significant improvement on the performance of the cluster 2, unfortunately this has yet to happen. Are there any special parameters on the configuration files that we need to change in order to adjust the Hadoop to a large hardware environment ? Are there any best practice you recommend? Thanks in advance. Avi - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11 - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11 - No virus found in this message. Checked by AVG - www.avg.com Version: 10.0.1382 / Virus Database: 1513/3707 - Release Date: 06/16/11
Poor scalability with map reduce application
Hello, I'm working with an application to calculate the temperatures of a squared board. I divide the board in a mesh, and represent the board as a list of (key, value) pairs with a key being the linear position of a cell within the mesh, and the value its temperature. I distribute the data during the map and calculate the temperature for next step in the reduce. You can see a more detailed explanation here, http://code.google.com/p/heat-transfer/source/browse/trunk/informe/Informe.pdf but the basic idea is the one I have just mentioned. The funny thing is that the more nodes I add the slower it runs!. With 7 nodes it takes 16 minutes, but with 4 nodes it takes only 8 minutes. You can see the code in file HeatTransfer.java which is found here, http://code.google.com/p/heat-transfer/source/browse/#svn%2Ftrunk%2Ffine%253Fstate%253Dclosed thanks in advance! Alberto. -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
Re: Poor scalability with map reduce application
Hi Harsh, thanks for your answer!. The cluster is homogeneus, every node has the same amount of cores and memory and is equally reachable in the network. The data is generated specifically for each run. I mean, I write the input data in 4 nodes for one run and in 7 nodes for another. So the input file will be replicated in 4 nodes when running the map reduce with 4 nodes, and in 7 nodes when running it with 7. I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. I would appreciate any other comment, thanks again On 21 June 2011 13:33, Harsh J ha...@cloudera.com wrote: Alberto, Please add more practical-related info like if your cluster is homogenous, if the number of maps and reduces in both runs are consistent (i.e., same data and same amount of reducers on 4 vs. 7?), and if map speculatives are on. Also, do you notice difference of time for a single map task across the two runs? Or is the difference on the reduce task side? On Tue, Jun 21, 2011 at 8:33 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: Hello, I'm working with an application to calculate the temperatures of a squared board. I divide the board in a mesh, and represent the board as a list of (key, value) pairs with a key being the linear position of a cell within the mesh, and the value its temperature. I distribute the data during the map and calculate the temperature for next step in the reduce. You can see a more detailed explanation here, http://code.google.com/p/heat-transfer/source/browse/trunk/informe/Informe.pdf but the basic idea is the one I have just mentioned. The funny thing is that the more nodes I add the slower it runs!. With 7 nodes it takes 16 minutes, but with 4 nodes it takes only 8 minutes. You can see the code in file HeatTransfer.java which is found here, http://code.google.com/p/heat-transfer/source/browse/#svn%2Ftrunk%2Ffine%253Fstate%253Dclosed thanks in advance! Alberto. -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto -- Harsh J -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
RE: Poor scalability with map reduce application
Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using resources earlier. Either way he would see this effect across both sets of runs if he is using the default parameters. I guess it would all depend on what kind of network layout the cluster is on. Matt -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 21, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Re: Poor scalability with map reduce application Alberto, On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. Maps and reduces are speculative by default, so must've been ON. Could you also post a general input vs. output record counts and statistics like that between your job runs, to correlate? The reducers get scheduled early but do not exactly reduce() until all maps are done. They just keep fetching outputs. Their scheduling can be controlled with some configurations (say, to start only after X% of maps are done -- by default it starts up when 5% of maps are done). -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations.
Re: Poor scalability with map reduce application
Thank you guys, I really appreciate your answers. I don't have access to the cluster right now, I'll check the info you are asking and come back in a couple of hours. BTW, I tried the app on two clusters with similar results. I'm using 0.21.0. thanks again, Alberto. On 21 June 2011 14:16, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.comwrote: Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using resources earlier. Either way he would see this effect across both sets of runs if he is using the default parameters. I guess it would all depend on what kind of network layout the cluster is on. Matt -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 21, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Re: Poor scalability with map reduce application Alberto, On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. Maps and reduces are speculative by default, so must've been ON. Could you also post a general input vs. output record counts and statistics like that between your job runs, to correlate? The reducers get scheduled early but do not exactly reduce() until all maps are done. They just keep fetching outputs. Their scheduling can be controlled with some configurations (say, to start only after X% of maps are done -- by default it starts up when 5% of maps are done). -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
Re: Poor scalability with map reduce application
I saw that the link I sent you may not be working, please take a look here to see what it is all about, https://docs.google.com/viewer?a=vpid=explorerchrome=truesrcid=0B5AOpwg8IzVANjJlODZhZDctNWUzMS00MmNhLWI3OWMtMWNhMTdjODQwNjVlhl=en_US thanks again! On 21 June 2011 14:22, Alberto Andreotti albertoandreo...@gmail.com wrote: Thank you guys, I really appreciate your answers. I don't have access to the cluster right now, I'll check the info you are asking and come back in a couple of hours. BTW, I tried the app on two clusters with similar results. I'm using 0.21.0. thanks again, Alberto. On 21 June 2011 14:16, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using resources earlier. Either way he would see this effect across both sets of runs if he is using the default parameters. I guess it would all depend on what kind of network layout the cluster is on. Matt -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 21, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Re: Poor scalability with map reduce application Alberto, On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. Maps and reduces are speculative by default, so must've been ON. Could you also post a general input vs. output record counts and statistics like that between your job runs, to correlate? The reducers get scheduled early but do not exactly reduce() until all maps are done. They just keep fetching outputs. Their scheduling can be controlled with some configurations (say, to start only after X% of maps are done -- by default it starts up when 5% of maps are done). -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
Re: Append to Existing File
Hi All, Does CDH3 support Existing File Append ? Regards, Jagaran From: Eric Charles eric.char...@u-mangate.com To: common-user@hadoop.apache.org Sent: Tue, 21 June, 2011 3:53:33 AM Subject: Re: Append to Existing File When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.inwrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric
Re: Poor scalability with map reduce application
Matt, You're right that it (slowstart) does not / would not affect much. I was merely explaining the reason behind his observance of reducers getting scheduled early, not really recommending a tweak for performance changes there. On Tue, Jun 21, 2011 at 10:46 PM, GOEKE, MATTHEW (AG/1000) matthew.go...@monsanto.com wrote: Harsh, Is it possible for mapred.reduce.slowstart.completed.maps to even play a significant role in this? The only benefit he would find in tweaking that for his problem would be to spread network traffic from the shuffle over a longer period of time at a cost of having the reducer using resources earlier. Either way he would see this effect across both sets of runs if he is using the default parameters. I guess it would all depend on what kind of network layout the cluster is on. Matt -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: Tuesday, June 21, 2011 12:09 PM To: common-user@hadoop.apache.org Subject: Re: Poor scalability with map reduce application Alberto, On Tue, Jun 21, 2011 at 10:27 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: I don't know if speculatives maps are on, I'll check it. One thing I observed is that reduces begin before all maps have finished. Let me check also if the difference is on the map side or in the reduce. I believe it's balanced, both are slower when adding more nodes, but i'll confirm that. Maps and reduces are speculative by default, so must've been ON. Could you also post a general input vs. output record counts and statistics like that between your job runs, to correlate? The reducers get scheduled early but do not exactly reduce() until all maps are done. They just keep fetching outputs. Their scheduling can be controlled with some configurations (say, to start only after X% of maps are done -- by default it starts up when 5% of maps are done). -- Harsh J This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of Viruses or other Malware. Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -- Harsh J
Re: Append to Existing File
Yes. -Joey On Jun 21, 2011 1:47 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi All, Does CDH3 support Existing File Append ? Regards, Jagaran From: Eric Charles eric.char...@u-mangate.com To: common-user@hadoop.apache.org Sent: Tue, 21 June, 2011 3:53:33 AM Subject: Re: Append to Existing File When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in wrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric
Deserializing a MapWritable entry set.
I want to extract the key-value pairs from a MapWritable, cast them into Integer (key) and Double (value) types, and add them to another collection. I'm attempting the following but this code is incorrect. // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable // initialProbabilities is of type Vector which can have (Integer, Double) entries in it for (Map.EntryWritable, Writable entry : initialDistributionStripe. entrySet()) { initialProbabilities.set(entry.getKey(), entry.getValue()); } Is there a convenient way to do this?
Re: Deserializing a MapWritable entry set.
Never worked with maps before, btw what are you trying to calculate? alberto. On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote: I want to extract the key-value pairs from a MapWritable, cast them into Integer (key) and Double (value) types, and add them to another collection. I'm attempting the following but this code is incorrect. // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable // initialProbabilities is of type Vector which can have (Integer, Double) entries in it for (Map.EntryWritable, Writable entry : initialDistributionStripe. entrySet()) { initialProbabilities.set(entry.getKey(), entry.getValue()); } Is there a convenient way to do this? -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
Re: Deserializing a MapWritable entry set.
Dhruv, If the Writable, Writable are IntWritable and DoubleWritable underneath, simply cast them properly to those types after get{Key,Value}() and then use the appropriate method to get the underlying value (Simple .get() in most cases). Is this what you're looking for? On Wed, Jun 22, 2011 at 1:44 AM, Dhruv Kumar dku...@ecs.umass.edu wrote: I want to extract the key-value pairs from a MapWritable, cast them into Integer (key) and Double (value) types, and add them to another collection. I'm attempting the following but this code is incorrect. // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable // initialProbabilities is of type Vector which can have (Integer, Double) entries in it for (Map.EntryWritable, Writable entry : initialDistributionStripe. entrySet()) { initialProbabilities.set(entry.getKey(), entry.getValue()); } Is there a convenient way to do this? -- Harsh J
Re: Deserializing a MapWritable entry set.
The exact problem I'm facing is the following: entry.getKey() and entry.getValue() return Writable types. How do I extract the buried Int and Double? In case of IntWritable and DoubleWritable return types, I could have used entry.getKey().get() and entry.getValue.get() and it would have been fine. On Tue, Jun 21, 2011 at 4:18 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: Never worked with maps before, btw what are you trying to calculate? There is no calculation in this loop, it is just a conversion from one type (MapWritable) produced by the reducer(s) to another type (Vector) which can be consumed by some legacy code for actual processing. alberto. On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote: I want to extract the key-value pairs from a MapWritable, cast them into Integer (key) and Double (value) types, and add them to another collection. I'm attempting the following but this code is incorrect. // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable // initialProbabilities is of type Vector which can have (Integer, Double) entries in it for (Map.EntryWritable, Writable entry : initialDistributionStripe. entrySet()) { initialProbabilities.set(entry.getKey(), entry.getValue()); } Is there a convenient way to do this? -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto
Re: Deserializing a MapWritable entry set.
((IntWritable) entry.getKey()).get(); and similar. On Wed, Jun 22, 2011 at 2:00 AM, Dhruv Kumar dku...@ecs.umass.edu wrote: The exact problem I'm facing is the following: entry.getKey() and entry.getValue() return Writable types. How do I extract the buried Int and Double? In case of IntWritable and DoubleWritable return types, I could have used entry.getKey().get() and entry.getValue.get() and it would have been fine. On Tue, Jun 21, 2011 at 4:18 PM, Alberto Andreotti albertoandreo...@gmail.com wrote: Never worked with maps before, btw what are you trying to calculate? There is no calculation in this loop, it is just a conversion from one type (MapWritable) produced by the reducer(s) to another type (Vector) which can be consumed by some legacy code for actual processing. alberto. On 21 June 2011 17:14, Dhruv Kumar dku...@ecs.umass.edu wrote: I want to extract the key-value pairs from a MapWritable, cast them into Integer (key) and Double (value) types, and add them to another collection. I'm attempting the following but this code is incorrect. // initialDistributionStripe is a MapWritableIntWritable, DoubleWritable // initialProbabilities is of type Vector which can have (Integer, Double) entries in it for (Map.EntryWritable, Writable entry : initialDistributionStripe. entrySet()) { initialProbabilities.set(entry.getKey(), entry.getValue()); } Is there a convenient way to do this? -- José Pablo Alberto Andreotti. Tel: 54 351 4730292 Móvil: 54351156526363. MSN: albertoandreo...@gmail.com Skype: andreottialberto -- Harsh J
Configuration settings
We have a small 4 node clusters that have 12GB of ram and the cpus are Quad Core Xeons. I'm assuming the defaults aren't that generous so what are some configuration changes I should make to take advantage of this hardware? Max map task? Max reduce tasks? Anything else? Thanks
Re: Configuration settings
Hi Mark, You can take a look at http://allthingshadoop.com/2010/04/28/map-reduce-tips-tricks-your-first-real-cluster/ and http://www.cloudera.com/blog/2009/03/configuration-parameters-what-can-you-just-ignore/to configure your cluster. Along with the tasks, you can change the child jvm heap size, data.xceivers etc. A good practice is to understand what kind of map reduce programming you will be doing, are your tasks CPU bound or memory bound and accordingly change your base cluster settings. Best Regards, Sonal https://github.com/sonalgoyal/hihoHadoop ETL and Data Integrationhttps://github.com/sonalgoyal/hiho Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Jun 22, 2011 at 6:16 AM, Mark static.void@gmail.com wrote: We have a small 4 node clusters that have 12GB of ram and the cpus are Quad Core Xeons. I'm assuming the defaults aren't that generous so what are some configuration changes I should make to take advantage of this hardware? Max map task? Max reduce tasks? Anything else? Thanks
TableOutputFormat not efficient than direct HBase API calls?
Hi, I am writing an Hadoop application that uses HBase as both source and sink. There is no reducer job in my application. I am using TableOutputFormat as the OutputFormatClass. I read it on the Internet that it is experimentally faster to directly instantiate HTable and use HTable.batch() in the Map than to use TableOutputFormat as the Map's OutputClass So I looked into the source code, org.apache.hadoop.hbase.mapreduce.TableOutputFormat. It looked like TableRecordWriter does not support batch updates, since TableRecordWriter.write() called HTable.put(new Put()). Am I right on this matter? Or does TableOutputFormat automatically do batch updates somehow? Or is there a specific way to do batch updates with TableOutputFormat? Any explanation is greatly appreciated. Ed
Re: Append to Existing File
On Tue, Jun 21, 2011 at 11:53 AM, Joey Echeverria j...@cloudera.com wrote: Yes. Sort-of kind-of... we support it only for the use case that HBase uses it. Mostly, we support sync() which was implemented at the same time. I know of several bugs in existing-file-append in CDH3 and 0.20-append. -Todd -Joey On Jun 21, 2011 1:47 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi All, Does CDH3 support Existing File Append ? Regards, Jagaran From: Eric Charles eric.char...@u-mangate.com To: common-user@hadoop.apache.org Sent: Tue, 21 June, 2011 3:53:33 AM Subject: Re: Append to Existing File When you say bugs pending, are your refering to HDFS-265 (which links to HDFS-1060, HADOOP-6239 and HDFS-744? Are there other issues related to append than the one above? Tks, Eric https://issues.apache.org/jira/browse/HDFS-265 On 21/06/11 12:36, madhu phatak wrote: Its not stable . There are some bugs pending . According one of the disccusion till date the append is not ready for production. On Tue, Jun 14, 2011 at 12:19 AM, jagaran dasjagaran_...@yahoo.co.in wrote: I am using hadoop-0.20.203.0 version. I have set dfs.support.append to true and then using append method It is working but need to know how stable it is to deploy and use in production clusters ? Regards, Jagaran From: jagaran dasjagaran_...@yahoo.co.in To: common-user@hadoop.apache.org Sent: Mon, 13 June, 2011 11:07:57 AM Subject: Append to Existing File Hi All, Is append to an existing file is now supported in Hadoop for production clusters? If yes, please let me know which version and how Thanks Jagaran -- Eric -- Todd Lipcon Software Engineer, Cloudera
Re: ClassNotFoundException while running quick start guide on Windows.
Thanks Jeff, it was a problem with JAVA_HOME. I have another problem now though, I have this: $JAVA: /cygdrive/c/Program Files/Java/jdk1.6.0_26/bin/java $JAVA_HEAP_MAX: -Xmx1000m $HADOOP_OPTS: -Dhadoop.log.dir=C:\Users\Drew Gross\Documents\Projects\discom\hadoop-0.21.0\logs -Dhadoop.log.file=hadoop.log -Dhadoop.home.dir=C:\Users\Drew Gross\Documents\Projects\discom\hadoop-0.21.0\ -Dhadoop.id.str= -Dhadoop.root.logger=INFO,console -Djava.library.path=/cygdrive/c/Users/Drew Gross/Documents/Projects/discom/hadoop-0.21.0/lib/native/ -Dhadoop.policy.file=hadoop-policy.xml $CLASS: org.apache.hadoop.util.RunJar Exception in thread main java.lang.NoClassDefFoundError: Gross\Documents\Projects\discom\hadoop-0/21/0\logs Caused by: java.lang.ClassNotFoundException: Gross\Documents\Projects\discom\hadoop-0.21.0\logs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: Gross\Documents\Projects\discom\hadoop-0.21.0\logs. Program will exit. (This is with some extra debugging info added by me in bin/hadoop) It looks like the windows style file names are causing problems, especially the spaces. Has anyone encountered this before, and know how to fix? I tried escaping the spaces and surrounding the file paths with quotes (not at the same time), but that didn't help. Drew On Tue, Jun 21, 2011 at 6:24 AM, madhu phatak phatak@gmail.com wrote: I think the jar have some issuses where its not able to read the Main class from manifest . try unjar the jar and see in Manifest.xml what is the main class and then run as follows bin/hadoop jar hadoop-*-examples.jar Full qualified main class grep input output 'dfs[a-z.]+' On Thu, Jun 16, 2011 at 10:23 AM, Drew Gross drew.a.gr...@gmail.com wrote: Hello, I'm trying to run the example from the quick start guide on Windows and I get this error: $ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' Exception in thread main java.lang.NoClassDefFoundError: Caused by: java.lang.ClassNotFoundException: at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: . Program will exit. Exception in thread main java.lang.NoClassDefFoundError: Gross\Documents\Projects\discom\hadoop-0/21/0\logs Caused by: java.lang.ClassNotFoundException: Gross\Documents\Projects\discom\hadoop-0.21.0\logs at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) Could not find the main class: Gross\Documents\Projects\discom\hadoop-0.21.0\logs. Program will exit. Does anyone know what I need to change? Thank you. From, Drew -- Forget the environment. Print this e-mail immediately. Then burn it. -- Forget the environment. Print this e-mail immediately. Then burn it.
Hadoop eclipse plugin stopped working after replacing hadoop-0.20.2 jar files with hadoop-0.20-append jar files
Guys, I was using hadoop eclipse plugin on hadoop 0.20.2 cluster.. It was working fine for me. I was using Eclipse SDK Helios 3.6.2 with the plugin hadoop-eclipse-plugin-0.20.3-SNAPSHOT.jar downloaded from JIRA MAPREDUCE-1280 Now for Hbase installation.. I had to use hadoop-0.20-append compiled jars..and I had to replace the old jar files with new 0.20-append compiled jar files.. But now after replacing .. my hadoop eclipse plugin is not working well for me. Whenever I am trying to connect to my hadoop master node from that and try to see DFS locations.. it is giving me the following error: * Error : Protocol org.apache.hadoop.hdfs.protocol.clientprotocol version mismatch (client 41 server 43)* However the hadoop cluster is working fine if I go directly on hadoop namenode use hadoop commands.. I can add files to HDFS.. run jobs from there.. HDFS web console and Map-Reduce web console are also working fine. but not able to use my previous hadoop eclipse plugin. Any suggestions or help for this issue ? Thanks, Praveenesh