RE: Limitation of key-value pairs for a particular key.
You are right Actually we were expecting the values to be sorted. We tried to reproduce the problem by this simple code private final IntWritable one=new IntWritable(1); private Text word=new Text(); @Override public void map(LongWritable key,Text value, Context context) throws IOException, InterruptedException { int N=3; for(int i=0;iN;i++) { word.set(i+); System.out.println(i); context.write(one,word); } } For smaller N numbers were in order but for N 300 order was not maintained From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, January 17, 2013 1:57 AM To: mapreduce-user Subject: RE: Limitation of key-value pairs for a particular key. We don't sort values (only keys) nor apply any manual limits in MR. Can your post a reproduceable test case to support your suspicion? On Jan 16, 2013 4:34 PM, Utkarsh Gupta utkarsh_gu...@infosys.commailto:utkarsh_gu...@infosys.com wrote: Hi, Thanks for the response. There was some issues with my code. I have checked that in detail. All the values of map are present in reducer but not in sorted order. This case happens if the number of values are too large for a key. Thanks Utkarsh From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.commailto:vino...@hortonworks.com] Sent: Thursday, January 10, 2013 11:00 PM To: mapreduce-user@hadoop.apache.orgmailto:mapreduce-user@hadoop.apache.org Subject: Re: Limitation of key-value pairs for a particular key. There isn't any limit like that. Can you reproduce this consistently? If so, please file a ticket. It will definitely help if you can provide a test case which can reproduce this issue. Thanks, +Vinod On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta utkarsh_gu...@infosys.commailto:utkarsh_gu...@infosys.com wrote: Hi, I am using Apache Hadoop 1.0.4 on a 10 node cluster of commodity machines with Ubuntu 12.04 Server edition. I am having a issue with my map reduce code. While debugging I found that the reducer can take 262145 values for a particular key. If more values are there, they seem to be corrupted. I checked the values while emitting from map and again checked in reducer. I am wondering is there any such kind of limitation in the Hadoop or is it a configuration problem. Thanks and Regards Utkarsh Gupta CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- +Vinod Hortonworks Inc. http://hortonworks.com/
RE: Limitation of key-value pairs for a particular key.
Hi, I think I know what's going on here. It has to do with how many spills the map task performs. You are emitting the numbers in order, so if there is only one spill, they stay in order. For larger number of records, the map task will create more than one spill, which must be merged. During the merge, the original order is not preserved. If you want the original order to be preserved, you must set io.sort.mb and/or io.sort.record.percent such that the map task requires only a single spill. Cheers, Sven From: Utkarsh Gupta [mailto:utkarsh_gu...@infosys.com] Sent: 18 January 2013 18:25 To: mapreduce-user@hadoop.apache.org Subject: RE: Limitation of key-value pairs for a particular key. You are right Actually we were expecting the values to be sorted. We tried to reproduce the problem by this simple code private final IntWritable one=new IntWritable(1); private Text word=new Text(); @Override public void map(LongWritable key,Text value, Context context) throws IOException, InterruptedException { int N=3; for(int i=0;iN;i++) { word.set(i+); System.out.println(i); context.write(one,word); } } For smaller N numbers were in order but for N 300 order was not maintained From: Harsh J [mailto:ha...@cloudera.com] Sent: Thursday, January 17, 2013 1:57 AM To: mapreduce-user Subject: RE: Limitation of key-value pairs for a particular key. We don't sort values (only keys) nor apply any manual limits in MR. Can your post a reproduceable test case to support your suspicion? On Jan 16, 2013 4:34 PM, Utkarsh Gupta utkarsh_gu...@infosys.com mailto:utkarsh_gu...@infosys.com wrote: Hi, Thanks for the response. There was some issues with my code. I have checked that in detail. All the values of map are present in reducer but not in sorted order. This case happens if the number of values are too large for a key. Thanks Utkarsh From: Vinod Kumar Vavilapalli [mailto:vino...@hortonworks.com mailto:vino...@hortonworks.com ] Sent: Thursday, January 10, 2013 11:00 PM To: mapreduce-user@hadoop.apache.org mailto:mapreduce-user@hadoop.apache.org Subject: Re: Limitation of key-value pairs for a particular key. There isn't any limit like that. Can you reproduce this consistently? If so, please file a ticket. It will definitely help if you can provide a test case which can reproduce this issue. Thanks, +Vinod On Thu, Jan 10, 2013 at 12:41 AM, Utkarsh Gupta utkarsh_gu...@infosys.com mailto:utkarsh_gu...@infosys.com wrote: Hi, I am using Apache Hadoop 1.0.4 on a 10 node cluster of commodity machines with Ubuntu 12.04 Server edition. I am having a issue with my map reduce code. While debugging I found that the reducer can take 262145 values for a particular key. If more values are there, they seem to be corrupted. I checked the values while emitting from map and again checked in reducer. I am wondering is there any such kind of limitation in the Hadoop or is it a configuration problem. Thanks and Regards Utkarsh Gupta CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** -- +Vinod Hortonworks Inc. http://hortonworks.com/
RE: On a lighter note
Awesome Tariq!! You made my day!! :-D Fabio Pitzolu www.gr-ci.com From: Anand Sharma [mailto:anand2sha...@gmail.com] Sent: venerdì 18 gennaio 2013 04:10 To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq mailto:donta...@gmail.com donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ https://mtariq.jux.com/ http://cloudfront.blogspot.com cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel mailto:michael_se...@hotmail.com michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang mailto:wang.yongzhi2...@gmail.com wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq mailto:donta...@gmail.com donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ https://mtariq.jux.com/ cloudfront.blogspot.com http://cloudfront.blogspot.com/
Re: building a department GPU cluster
Thiago Vieira wrote: I've seen some academic researches on this direction, with good results. Some computations can be expressed by GPGPU, but it is still a restrict number of cases. If is not easy to solve problems using MapReduce, solve some problems with SIMD is harder. Ok.. Thank you all for your time.. I'll keep searching. Best regards. Robi -- Thiago Vieira On Thu, Jan 17, 2013 at 9:24 PM, Russell Jurney russell.jur...@gmail.com mailto:russell.jur...@gmail.com wrote: Hadoop streaming can do this, and there's been some discussion in the past, but it's not a core use case. Check the list archives. Russell Jurney http://datasyndrome.com On Jan 17, 2013, at 9:25 AM, Jeremy Lewi jer...@lewi.us mailto:jer...@lewi.us wrote: I don't think running hadoop on a GPU cluster is a common use case; the types of workloads for a hadoop vs. gpu cluster are very different although a quick google search did turn up some. So this is probably not the best mailing list for your question. J On Thu, Jan 17, 2013 at 5:18 AM, Roberto Nunnari roberto.nunn...@supsi.ch mailto:roberto.nunn...@supsi.ch wrote: Roberto Nunnari wrote: Hi all. I'm writing to you to ask for advice or a hint to the right direction. In our department, more and more researchers ask us (IT administrators) to assemble (or to buy) GPGPU powered workstations to do parallel computing. As I already manage a small CPU cluster (resources managed using SGE), with my boss we talked about building a new GPU cluster. The problem is that I have no experience at all with GPU clusters. Apart from the already running GPU workstations, we already have some new HW that looks promising to me as a starting point for a GPU cluster. - 1x Dell PowerEdge R720 - 1x Dell PowerEdge C410x - 1x NVIDIA M2090 PCIe x16 - 1x NVIDIA iPASS Cable Kit (Dell forgot to include the iPASS adapter for the R720!! :-D) I'd be grateful if you could kindly give me some advice and/or hint to the right direction. In particular I'm interested on your opinion on: 1) is the above HW suitable for a small (2 to 4/6 GPUs) GPU cluster? 2) is apache adhoop suitable (or what could we use?) as a queuing and resource management system? We would like the cluster to be usable by many users at once in a way that no user has to worry about resources, just like we do on the CPU cluster with SGE. 3) What distribution of linux would be more appropriate? 4) necessary stack of sw? (cuda, hadoop, other?) Thank you very much for your valuable insight! Best regards. Robi Anybody on this, please? Robi
Re: On a lighter note
LOL Thanks, Josh Long Spring Developer Advocate SpringSource, a Division of VMware http://www.joshlong.com || joshlong.com || http://twitter.com/starbuxman On Fri, Jan 18, 2013 at 5:06 PM, Mohammad Tariq donta...@gmail.com wrote: lol :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 1:54 PM, Fabio Pitzolu fabio.pitz...@gr-ci.com wrote: Awesome Tariq!! You made my day!! :-D Fabio Pitzolu www.gr-ci.com From: Anand Sharma [mailto:anand2sha...@gmail.com] Sent: venerdì 18 gennaio 2013 04:10 To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
Re: On a lighter note
Awesome :) Regards Prabhjot
Re: OutofMemoryError when running an YARN application with 25 containers
Hi Anil, Thanks or the reply. I was trying google to know how to increase heap size, and found that the option -Xmx1500m has be passed as command line argument for java. Is that the way you are suggesting? If so, how can I pass it for Application Master, because it is Client program that actually launches the AM... Or is there any other way for doing it? Thanks, Kishore On Tue, Jan 15, 2013 at 11:48 AM, anil gupta anilgupt...@gmail.com wrote: The following log tells you the exact error: *JVMDUMP013I Processed dump event systhrow, detail java/lang/OutOfMemoryError.* * * *Exception in thread Thread-7 java.lang.OutOfMemoryError* *at ApplicationMaster.readMessage(**ApplicationMaster.java:241)* *at ApplicationMaster$**SectionLeaderRunnable.run(** ApplicationMaster.java:825)* * * *at java.lang.Thread.run(Thread.**java:736)* You might need to increase the HeapSize of ApplicationMaster. HTH, Anil Gupta On Mon, Jan 14, 2013 at 4:35 AM, Krishna Kishore Bonagiri write2kish...@gmail.com wrote: Hi, I am getting the following error in ApplicationMaster.stderr when running an application with around 25 container launches. How can I resolve this issue? JVMDUMP006I Processing dump event systhrow, detail java/lang/OutOfMemoryError - please wait. JVMDUMP032I JVM requested Heap dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd' in response to an event JVMDUMP010I Heap dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd JVMDUMP032I JVM requested Java dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt JVMDUMP032I JVM requested Snap dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc' in response to an event JVMDUMP010I Snap dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc JVMDUMP013I Processed dump event systhrow, detail java/lang/OutOfMemoryError. Exception in thread Thread-7 java.lang.OutOfMemoryError at ApplicationMaster.readMessage(ApplicationMaster.java:241) at ApplicationMaster$SectionLeaderRunnable.run(ApplicationMaster.java:825) at java.lang.Thread.run(Thread.java:736) Thanks, Kishore -- Thanks Regards, Anil Gupta
Re: OutofMemoryError when running an YARN application with 25 containers
Hi Arun, Thanks for the reply. I am not running a Map Reduce application, running some distributed application. And I am using 2.0.0-alpha. Also, I have one more query. I am seeing that from the time ApplicationMaster is sumitted by Client to the ASM part of AM, it is taking around 7 seconds for AM to come up. Is there a way to improve that time? Thanks, Kishore On Tue, Jan 15, 2013 at 5:43 PM, Arun C Murthy a...@hortonworks.com wrote: How many maps reduces did your job have? Also, what release are you using? I'd recommend at least 2.0.2-alpha, though we should be able to release 2.0.3-alpha very soon. Arun On Jan 14, 2013, at 4:35 AM, Krishna Kishore Bonagiri wrote: Hi, I am getting the following error in ApplicationMaster.stderr when running an application with around 25 container launches. How can I resolve this issue? JVMDUMP006I Processing dump event systhrow, detail java/lang/OutOfMemoryError - please wait. JVMDUMP032I JVM requested Heap dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd' in response to an event JVMDUMP010I Heap dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/heapdump.20130114.044646.16631.0001.phd JVMDUMP032I JVM requested Java dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt' in response to an event JVMDUMP010I Java dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/javacore.20130114.044646.16631.0002.txt JVMDUMP032I JVM requested Snap dump using '/tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc' in response to an event JVMDUMP010I Snap dump written to /tmp/nm-local-dir/usercache/dsadm/appcache/application_1355219238448_0461/container_1355219238448_0461_01_01/Snap.20130114.044646.16631.0003.trc JVMDUMP013I Processed dump event systhrow, detail java/lang/OutOfMemoryError. Exception in thread Thread-7 java.lang.OutOfMemoryError at ApplicationMaster.readMessage(ApplicationMaster.java:241) at ApplicationMaster$SectionLeaderRunnable.run(ApplicationMaster.java:825) at java.lang.Thread.run(Thread.java:736) Thanks, Kishore -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
RE: On a lighter note
LOL just amazing... I remember having a similar conversation with someone who didn't understand meaning of secondary namenode :-) Viral -- From: iwannaplay games Sent: 1/18/2013 1:24 AM To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome :) Regards Prabhjot
Re: how to restrict the concurrent running map tasks?
You will need to use an alternative scheduler for this. Look at minMaps/maxMaps/etc. properties in FairScheduler at http://hadoop.apache.org/docs/stable/fair_scheduler.html#Allocation+File+%28fair-scheduler.xml%29 Alternatively, look at resource-based scheduling in CapacityScheduler at http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling P.s. Do not use general@ list for user level queries. The right list is user@hadoop.apache.org. On Fri, Jan 18, 2013 at 3:52 PM, hwang joe.haiw...@gmail.com wrote: Hi all: My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 parameter related to this question. a) mapred.job.map.capacity but in my hadoop version, this parameter seems abandoned. b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob ( http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector/1.0.2/mapred-default.xml ) I set this variable like below: Configuration conf = new Configuration(); conf.set(date, date); conf.set(mapred.job.queue.name, hadoop); conf.set(mapred.jobtracker.taskScheduler.maxRunningTasksPerJob, 10); DistributedCache.createSymlink(conf); Job job = new Job(conf, ConstructApkDownload_ + date); ... The problem is that it doesn't work. There is still more than 50 maps running as the job starts. I'm not sure whether I set this parameter in wrong way ? or misunderstand it. After looking through the hadoop document, I can't find another parameter to limit the concurrent running map tasks. Hope someone can help me ,Thanks. -- Harsh J
Re: On a lighter note
Folks quite often get confused by the name. But this one is just unbeatable :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote: LOL just amazing... I remember having a similar conversation with someone who didn't understand meaning of secondary namenode :-) Viral -- From: iwannaplay games Sent: 1/18/2013 1:24 AM To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome :) Regards Prabhjot
Re: Estimating disk space requirements
Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101
Re: Estimating disk space requirements
Hi, some comments are inside your message ... 2013/1/18 Panshul Whisper ouchwhis...@gmail.com Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. 11 GB is quite small - or is there a typo? The amount of raw data is about 115 GB *nr of items* *size of an item* * * *Bytes* *GB* 24 1.00E+006 5 1.02E+003 12288000 114.4409179688 (without additional key and metadata) Depending in the amount of overhead this could be about 200GB x 3 is 600GB just for distributed storage. And than you need some capacity to store intermediate processing data (20% to 30%) of the processed data is recommendet. So you might prepare a capacity of 1TB or even more if your dataset grows. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. The replication on the HDFS level is sufficient for keeping the data safe, no need to replicate the HBase tables separately. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 Best wishes Mirko
FW: HBase Master not getting started
Hi, Could you please guide? Regards, Deepak -Original Message- From: Kumar, Deepak8 [CCC-OT_IT NE] Sent: Thursday, January 17, 2013 1:36 PM To: 'cdh-u...@cloudera.org' Cc: Kumar, Deepak8 [CCC-OT_IT NE] Subject: HBase Master not getting started Hi, Something abnormal happened in my cluster. Actually the default location of snapshot dataDir for zookeeper is /var/lib/zookeeper in cdh4. The disk at which /var location is configured became full and the cluster went down (zookeeper HBase was in ERROR status). I have cleaned /var location but it seems the snapshot dataDir location of zookeeper is not getting updated HBase master is not able to connect to zookeeper. Could you please guide me? Regards, Deepak
Re: On a lighter note
Someone should made one about unsubscribing from this mailing list ! :D *Fabio Pitzolu* Consultant - BI Infrastructure Mob. +39 3356033776 Telefono 02 87157239 Fax. 02 93664786 *Gruppo Consulenza Innovazione - http://www.gr-ci.com* 2013/1/18 Mohammad Tariq donta...@gmail.com Folks quite often get confused by the name. But this one is just unbeatable :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote: LOL just amazing... I remember having a similar conversation with someone who didn't understand meaning of secondary namenode :-) Viral -- From: iwannaplay games Sent: 1/18/2013 1:24 AM To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome :) Regards Prabhjot
Re: Estimating disk space requirements
Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Estimating disk space requirements
20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: On a lighter note
:) ∞ Shashwat Shriparv On Fri, Jan 18, 2013 at 6:43 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote: Someone should made one about unsubscribing from this mailing list ! :D *Fabio Pitzolu* Consultant - BI Infrastructure Mob. +39 3356033776 Telefono 02 87157239 Fax. 02 93664786 *Gruppo Consulenza Innovazione - http://www.gr-ci.com* 2013/1/18 Mohammad Tariq donta...@gmail.com Folks quite often get confused by the name. But this one is just unbeatable :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote: LOL just amazing... I remember having a similar conversation with someone who didn't understand meaning of secondary namenode :-) Viral -- From: iwannaplay games Sent: 1/18/2013 1:24 AM To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome :) Regards Prabhjot
Re: Estimating disk space requirements
If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Query: Hadoop's threat to Informatica
Informatica's take on the question: http://www.informatica.com/hadoop/ My take on the question: Hadoop is definitely disruptive and there have been times where we've been able to blow missed data pipeline SLAs out of the water using Hadoop where tools like Informatica were not able to. But Informatica's take on metadata management, mixed workloads, and governance are somewhat well taken. It's not that this stuff isn't doable with Hadoop, but it's that the maturity of enterprise tools like Informatica are a little farther along. Jeff On Thu, Jan 17, 2013 at 10:51 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Sameer, Pl find my comments embedded below : Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 11:21 AM, Sameer Jain sameer.j...@evalueserve.com wrote: Hi, I am trying to understand the different data analysis algorithms available in the market. Analyst opinion suggests that Informatica and Hadoop have the best offerings in this space. However, I am not very clear as to how the two are different and how they compete, because Hadoop is being used by IBM etc. Since you appear to be a fairly seasoned expert in this domain, I would like to get your perspective on the following: I would hugely appreciate any thoughts/insights around · The workings of Hadoop/Mapreduce Hadoop is an open source platform that allows us to store and process huge, really huge, amount of data over a network of machines(need not be very sophisticated). It has 2 layers viz : HDFS MapReduce for storage processing respectively. · Informatica’s product offering They can tell you better. This list is specific to Hadoop ecosystem. · A comparison of which one of these is better Depends upon the particular use case. One size doesn't fit all. · A view of can and/or is Hadoop in competition with Informatica. I don't think so. Informatica is basically an ETL thing(if I am not wrong), while we leverage Hadoop's power to create ETL tools with the Help of different Hadoop sub projects. Though it is possible to use them together. Regards, Sameer *Sameer Jain* -- Research Lead Evalueserve Office: + 91 124 4621615 Mob: + 91 7827256066 Fax: + 91 124 406 3430 www.evalueserve.com . -- The information in this e-mail is the property of Evalueserve and is confidential and privileged. It is intended solely for the addressee. Access to this email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken in reliance on it is prohibited and will be unlawful. If you receive this message in error, please notify the sender immediately and delete all copies of this message.
Re: Estimating disk space requirements
It all depend what you want to do with this data and the power of each single node. There is no one size fit all rule. The more nodes you have, the more CPU power you will have to process the data... But if you 80GB boxes CPUs are faster than your 40GB boxes CPU ,maybe you should take the 80GB then. If you want to get better advices from the list, you will need to beter define you needs and the nodes you can have. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Estimating disk space requirements
Thank you for the reply. It will be great if someone can suggest, if setting up my cluster on Rackspace is good or on Amazon using EC2 servers? keeping in mind Amazon services have been having a lot of downtimes... My main point of concern is performance and availablitiy. My cluster has to be very Highly Available. Thanks. On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: It all depend what you want to do with this data and the power of each single node. There is no one size fit all rule. The more nodes you have, the more CPU power you will have to process the data... But if you 80GB boxes CPUs are faster than your 40GB boxes CPU ,maybe you should take the 80GB then. If you want to get better advices from the list, you will need to beter define you needs and the nodes you can have. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Estimating disk space requirements
I have been using AWS since quite sometime and I have never faced any issue. Personally speaking, I found AWS really flexible. You get a great deal of flexibility in choosing services depending upon your requirements. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 7:54 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Thank you for the reply. It will be great if someone can suggest, if setting up my cluster on Rackspace is good or on Amazon using EC2 servers? keeping in mind Amazon services have been having a lot of downtimes... My main point of concern is performance and availablitiy. My cluster has to be very Highly Available. Thanks. On Fri, Jan 18, 2013 at 3:12 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: It all depend what you want to do with this data and the power of each single node. There is no one size fit all rule. The more nodes you have, the more CPU power you will have to process the data... But if you 80GB boxes CPUs are faster than your 40GB boxes CPU ,maybe you should take the 80GB then. If you want to get better advices from the list, you will need to beter define you needs and the nodes you can have. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Cohesion of Hadoop team?
Hi, looking at the derivation of the 0.23.x 2.0.x branches on one hand, and the 1.x branches on the other, as described here: http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E One gets the impression the Hadoop committers are split into two teams, with one team working on 0.23.x/2.0.2 and another team working on 1.x, running the risk of increasingly diverging products eventually competing with each other. Is that the case? Is there expected to be a Hadoop 3.0 where the results of the two lines of development will merge or is it increasingly likely the subteams will continue their separate routes? Thanks, Glen -- Glen Mazza Talend Community Coders - coders.talend.com blog: www.jroller.com/gmazza
Re: Execution of udf
No but the query execution shows a reducer running .. And infant I feel that reduce phase can be there On Friday, January 18, 2013, Dean Wampler wrote: There is no reduce phase needed in this query. On Fri, Jan 18, 2013 at 6:59 AM, nagarjuna kanamarlapudi nagarjuna.kanamarlap...@gmail.com javascript:_e({}, 'cvml', 'nagarjuna.kanamarlap...@gmail.com'); wrote: Hi, Select col1,myudf(col2,col3) from table1; In what phase if map reduce an udf is executed. In the very beginning, I assumed that hive will be joining two tables., getting the required columns and then applies udf on columns specified I.e., essentially on reducer phase . But later on I realised that I was wrong. Is there any specific parameter which suggests hive to call udf at reducer phase rather than at Mapper phase. Regards, Nagarjuna -- Sent from iPhone -- *Dean Wampler, Ph.D.* thinkbiganalytics.com +1-312-339-1330 -- Sent from iPhone
Re: Problems
Leo, I downloaded the suggested 1.6.0_32 Java version to my home directory, but I am still experiencing the same problem (See error below). The only thing that I have set in my hadoop-env.sh file is the JAVA_HOME environment variable. I have also tried it with the Java directory added to PATH. export JAVA_HOME=/home/shu/jre1.6.0_32 export PATH=$PATH:/home/shu/jre1.6.0_32 Every other environment variable is defaulted. Just to clarify, I have tried this in Local Standalone mode and also in Pseudo-Distributed Mode with the same result. Frustrating to say the least, Sean Hudson shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0xb7fc51fb, pid=23112, tid=3075554208 # # JRE version: 6.0_32-b05 # Java VM: Java HotSpot(TM) Client VM (20.7-b02 mixed mode, sharing linux-x86 ) # Problematic frame: # C [ld-linux.so.2+0x91fb] double+0xab # # An error report file with more information is saved as: # /home/shu/hadoop-1.0.4/hs_err_pid23112.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Aborted -Original Message- From: Leo Leung Sent: Thursday, January 17, 2013 6:46 PM To: user@hadoop.apache.org Subject: RE: Problems Use Sun/Oracle 1.6.0_32+ Build should be 20.7-b02+ 1.7 causes failure and AFAIK, not supported, but you are free to try the latest version and report back. -Original Message- From: Sean Hudson [mailto:sean.hud...@ostiasolutions.com] Sent: Thursday, January 17, 2013 6:57 AM To: user@hadoop.apache.org Subject: Re: Problems Hi, My Java version is java version 1.6.0_25 Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing) Would you advise obtaining a later Java version? Sean -Original Message- From: Jean-Marc Spaggiari Sent: Thursday, January 17, 2013 2:52 PM To: user@hadoop.apache.org Subject: Re: Problems Hi Sean, This is an issue with your JVM. Not related to hadoop. Which JVM are you using, and can you try with the last from Sun? JM 2013/1/17, Sean Hudson sean.hud...@ostiasolutions.com: Hi, I have recently installed hadoop-1.0.4 on a linux machine. Whilst working through the post-install instructions contained in the “Quick Start” guide, I incurred the following catastrophic Java runtime error (See below). I have attached the error report file “hs_err_pid24928.log”. I have submitted a Java bug report, but perhaps it is a known hadoop-1.0.4 version problem. I am a first time user of Hadoop and would welcome guidance on this problem, Regards, Sean Hudson. shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0xb7f2b1fb, pid=24928, tid=3074923424 # # JRE version: 6.0_25-b06 # Java VM: Java HotSpot(TM) Client VM (20.0-b11 mixed mode, sharing linux-x86 ) # Problematic frame: # C [ld-linux.so.2+0x91fb] double+0xab # # An error report file with more information is saved as: # /home/shu/hadoop-1.0.4/hs_err_pid24928.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Aborted -- Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, Co. Wicklow, Ireland Registered in Ireland CRO No.507541 This email and any attachments to it is, unless otherwise stated, confidential, may contain copyright material and is for the use of the intended recipient only. If you have received this email in error, please notify the sender by return and deleting all copies. Any views expressed in this email are those of the sender and do not form part of any contract between Ostia Software Solutions Limited and any other party. -- Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, Co. Wicklow, Ireland Registered in Ireland CRO No.507541 This email and any attachments to it is, unless otherwise stated, confidential, may contain copyright material and is for the use of the intended recipient only. If you have received this email in error, please notify the sender by return and deleting all copies. Any views expressed in this email are those of the sender and do not form part of any contract between Ostia Software Solutions Limited and any other party. -- Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, Co. Wicklow, Ireland Registered in Ireland CRO No.507541 This email and any attachments to it is, unless otherwise stated, confidential, may
Re: On a lighter note
:) :) On Fri, Jan 18, 2013 at 7:08 PM, shashwat shriparv dwivedishash...@gmail.com wrote: :) ∞ Shashwat Shriparv On Fri, Jan 18, 2013 at 6:43 PM, Fabio Pitzolu fabio.pitz...@gr-ci.comwrote: Someone should made one about unsubscribing from this mailing list ! :D *Fabio Pitzolu* Consultant - BI Infrastructure Mob. +39 3356033776 Telefono 02 87157239 Fax. 02 93664786 *Gruppo Consulenza Innovazione - http://www.gr-ci.com* 2013/1/18 Mohammad Tariq donta...@gmail.com Folks quite often get confused by the name. But this one is just unbeatable :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 4:52 PM, Viral Bajaria viral.baja...@gmail.comwrote: LOL just amazing... I remember having a similar conversation with someone who didn't understand meaning of secondary namenode :-) Viral -- From: iwannaplay games Sent: 1/18/2013 1:24 AM To: user@hadoop.apache.org Subject: Re: On a lighter note Awesome :) Regards Prabhjot -- Regards, Varun Kumar.P
Re: Problems
Hi Sean, It's strange. You should not faced that. I faced same kind of issues on a desktop with memory errors. Can you install memtest86 and fullty test your memory (one pass is enought) to make sure you don't have issues on that side? 2013/1/18, Sean Hudson sean.hud...@ostiasolutions.com: Leo, I downloaded the suggested 1.6.0_32 Java version to my home directory, but I am still experiencing the same problem (See error below). The only thing that I have set in my hadoop-env.sh file is the JAVA_HOME environment variable. I have also tried it with the Java directory added to PATH. export JAVA_HOME=/home/shu/jre1.6.0_32 export PATH=$PATH:/home/shu/jre1.6.0_32 Every other environment variable is defaulted. Just to clarify, I have tried this in Local Standalone mode and also in Pseudo-Distributed Mode with the same result. Frustrating to say the least, Sean Hudson shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0xb7fc51fb, pid=23112, tid=3075554208 # # JRE version: 6.0_32-b05 # Java VM: Java HotSpot(TM) Client VM (20.7-b02 mixed mode, sharing linux-x86 ) # Problematic frame: # C [ld-linux.so.2+0x91fb] double+0xab # # An error report file with more information is saved as: # /home/shu/hadoop-1.0.4/hs_err_pid23112.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Aborted -Original Message- From: Leo Leung Sent: Thursday, January 17, 2013 6:46 PM To: user@hadoop.apache.org Subject: RE: Problems Use Sun/Oracle 1.6.0_32+ Build should be 20.7-b02+ 1.7 causes failure and AFAIK, not supported, but you are free to try the latest version and report back. -Original Message- From: Sean Hudson [mailto:sean.hud...@ostiasolutions.com] Sent: Thursday, January 17, 2013 6:57 AM To: user@hadoop.apache.org Subject: Re: Problems Hi, My Java version is java version 1.6.0_25 Java(TM) SE Runtime Environment (build 1.6.0_25-b06) Java HotSpot(TM) Client VM (build 20.0-b11, mixed mode, sharing) Would you advise obtaining a later Java version? Sean -Original Message- From: Jean-Marc Spaggiari Sent: Thursday, January 17, 2013 2:52 PM To: user@hadoop.apache.org Subject: Re: Problems Hi Sean, This is an issue with your JVM. Not related to hadoop. Which JVM are you using, and can you try with the last from Sun? JM 2013/1/17, Sean Hudson sean.hud...@ostiasolutions.com: Hi, I have recently installed hadoop-1.0.4 on a linux machine. Whilst working through the post-install instructions contained in the “Quick Start” guide, I incurred the following catastrophic Java runtime error (See below). I have attached the error report file “hs_err_pid24928.log”. I have submitted a Java bug report, but perhaps it is a known hadoop-1.0.4 version problem. I am a first time user of Hadoop and would welcome guidance on this problem, Regards, Sean Hudson. shu@meath-nua:~/hadoop-1.0.4 bin/hadoop jar hadoop-examples-1.0.4.jar grep input output 'dfs[a-z.]+' # # A fatal error has been detected by the Java Runtime Environment: # # SIGFPE (0x8) at pc=0xb7f2b1fb, pid=24928, tid=3074923424 # # JRE version: 6.0_25-b06 # Java VM: Java HotSpot(TM) Client VM (20.0-b11 mixed mode, sharing linux-x86 ) # Problematic frame: # C [ld-linux.so.2+0x91fb] double+0xab # # An error report file with more information is saved as: # /home/shu/hadoop-1.0.4/hs_err_pid24928.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # The crash happened outside the Java Virtual Machine in native code. # See problematic frame for where to report the bug. # Aborted -- Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, Co. Wicklow, Ireland Registered in Ireland CRO No.507541 This email and any attachments to it is, unless otherwise stated, confidential, may contain copyright material and is for the use of the intended recipient only. If you have received this email in error, please notify the sender by return and deleting all copies. Any views expressed in this email are those of the sender and do not form part of any contract between Ostia Software Solutions Limited and any other party. -- Ostia Software Solutions Limited, 6 The Mill Building, The Maltings, Bray, Co. Wicklow, Ireland Registered in Ireland CRO No.507541 This email and any attachments to it is, unless otherwise stated, confidential, may contain copyright material and is for the use of the intended recipient only. If you have received this email in error, please notify the sender by return and deleting all copies.
unsubscribe
Please unsubscribe be from this news feed Thank you Cristian Cira Graduate Research Assistant Parallel Architecture and System Laboratory(PASL) Shelby Center 2105 Auburn University, AL 36849 From: yiyu jia [jia.y...@gmail.com] Sent: Friday, January 18, 2013 12:12 AM To: user@hadoop.apache.org Subject: run hadoop in standalone mode Hi, I tried to run hadoop in standalone mode according to hadoop online document. But, I get error message as below. I run command ./bin/hadoop jar hadoop-examples-1.1.1.jar pi 10 100. 13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) java.lang.RuntimeException: java.net.ConnectException: Call to localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused I disabled ipV6, firewall on my linux machine. But, i still get this error message. localhost is bound with 127.0.01 . core-site.xml and mapreduce-site.xml are empty as they are not modified. Anybody can give me a hint if I need to do some specific configuration to run hadoop in standalone mode? thanks and regards, Yiyu
How to unsubcribe from the list (Re: unsubscribe)
Search on google and clic on the first link ;) https://www.google.ca/search?q=unsubscribe+hadoop+mailing+list 2013/1/18, Cristian Cira cmc0...@tigermail.auburn.edu: Please unsubscribe be from this news feed Thank you Cristian Cira Graduate Research Assistant Parallel Architecture and System Laboratory(PASL) Shelby Center 2105 Auburn University, AL 36849 From: yiyu jia [jia.y...@gmail.com] Sent: Friday, January 18, 2013 12:12 AM To: user@hadoop.apache.org Subject: run hadoop in standalone mode Hi, I tried to run hadoop in standalone mode according to hadoop online document. But, I get error message as below. I run command ./bin/hadoop jar hadoop-examples-1.1.1.jar pi 10 100. 13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) java.lang.RuntimeException: java.net.ConnectException: Call to localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused I disabled ipV6, firewall on my linux machine. But, i still get this error message. localhost is bound with 127.0.01 . core-site.xml and mapreduce-site.xml are empty as they are not modified. Anybody can give me a hint if I need to do some specific configuration to run hadoop in standalone mode? thanks and regards, Yiyu
Re: How to unsubcribe from the list (Re: unsubscribe)
This was EPIC!! :-D *Fabio Pitzolu* * * 2013/1/18 Jean-Marc Spaggiari jean-m...@spaggiari.org Search on google and clic on the first link ;) https://www.google.ca/search?q=unsubscribe+hadoop+mailing+list 2013/1/18, Cristian Cira cmc0...@tigermail.auburn.edu: Please unsubscribe be from this news feed Thank you Cristian Cira Graduate Research Assistant Parallel Architecture and System Laboratory(PASL) Shelby Center 2105 Auburn University, AL 36849 From: yiyu jia [jia.y...@gmail.com] Sent: Friday, January 18, 2013 12:12 AM To: user@hadoop.apache.org Subject: run hadoop in standalone mode Hi, I tried to run hadoop in standalone mode according to hadoop online document. But, I get error message as below. I run command ./bin/hadoop jar hadoop-examples-1.1.1.jar pi 10 100. 13/01/18 01:07:05 INFO ipc.Client: Retrying connect to server: localhost/127.0.0.1:8020http://127.0.0.1:8020. Already tried 9 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) java.lang.RuntimeException: java.net.ConnectException: Call to localhost/127.0.0.1:8020http://127.0.0.1:8020 failed on connection exception: java.net.ConnectException: Connection refused I disabled ipV6, firewall on my linux machine. But, i still get this error message. localhost is bound with 127.0.01 . core-site.xml and mapreduce-site.xml are empty as they are not modified. Anybody can give me a hint if I need to do some specific configuration to run hadoop in standalone mode? thanks and regards, Yiyu
Re: Hadoop Scalability
Also, you may have to adjust your algorithms. For instance, the conventional standard algorithm for SVD is a Lanczos iterative algorithm. Iteration in Hadoop is death because of job invocation time ... what you wind up with is an algorithm that will handle big data but with a slow-down factor that makes a single node perform at the same level as 100 Hadoop nodes or more. Scaling with iterative algorithms like this is irrelevant because of the enormous fixed cost. On the other hand, you can switch to some of the recently developed stochastic projection algorithms which give a non-iterative algorithm that requires 4-7 map-reduce steps (depending on which outputs you need). With these projection algorithms, Hadoop can out-run other techniques even with quite modest cluster sizes and will scale linearly. On Thu, Jan 17, 2013 at 9:47 PM, Stephen Boesch java...@gmail.com wrote: Hi Thiago, Subjectively: there are a number of items to consider to achieve nearly linear scaling: - if the work is well balanced among the tasks - no skew - No skew in the association of tasks to nodes. Note: this skew actually happens by default if the number of tasks is less than the cluster capacity of slots. You will notice that on a cluster with 20 nodes, with each node set to 20 mapper tasks, if you launch a job with 20 maps it may well have all of them running on one node. - with higher number of tasks the risk of having stragglers affecting overall throughput/performance increases unless speculative execution were set properly - hadoop configuration settings come under more pressure with more - properly tuning the number of mappers and reducers to (a) your node and cluster characteristics and (b) the particular tasks has a large impact on performance. In my experience the settings are often set too conservatively / too low to take advantage of the node and cluster resources So in summary hadoop itself is capable of nearly linear scaling to low thousands of nodes, but configuring the cluster to really achieve that requires effort. 2013/1/17 Thiago Vieira tpbvie...@gmail.com Hello! Is common to see this sentence: Hadoop Scales Linearly. But, is there any performance evaluation to confirm this? In my evaluations, Hadoop processing capacity scales linearly, but not proportional to number of nodes, the processing capacity achieved with 20 nodes is not the double of the processing capacity achieved with 10 nodes. Is there any evaluation about this? Thank you! -- Thiago Vieira
Re: how to restrict the concurrent running map tasks?
General is for product announcements and the like. You really should direct your question to mapreduce-user@. I have bcced general. I am not an expert on this, but I looked and it appears that you have to use a special scheduler in the JobTracker to make this happen. org.apache.hadoop.mapred.LimitTasksPerJobTaskScheduler It looks a lot like the fifo scheduler but with a limit on the number of tasks. I am not sure it this is something that will work for you or not. --Bobby On 1/18/13 4:22 AM, hwang joe.haiw...@gmail.com wrote: Hi all: My hadoop version is 1.0.2. Now I want at most 10 map tasks running at the same time. I have found 2 parameter related to this question. a) mapred.job.map.capacity but in my hadoop version, this parameter seems abandoned. b) mapred.jobtracker.taskScheduler.maxRunningTasksPerJob ( http://grepcode.com/file/repo1.maven.org/maven2/com.ning/metrics.collector /1.0.2/mapred-default.xml ) I set this variable like below: Configuration conf = new Configuration(); conf.set(date, date); conf.set(mapred.job.queue.name, hadoop); conf.set(mapred.jobtracker.taskScheduler.maxRunningTasksPerJob, 10); DistributedCache.createSymlink(conf); Job job = new Job(conf, ConstructApkDownload_ + date); ... The problem is that it doesn't work. There is still more than 50 maps running as the job starts. I'm not sure whether I set this parameter in wrong way ? or misunderstand it. After looking through the hadoop document, I can't find another parameter to limit the concurrent running map tasks. Hope someone can help me ,Thanks.
Re: Cohesion of Hadoop team?
On Fri, Jan 18, 2013 at 6:48 AM, Glen Mazza gma...@talend.com wrote: Hi, looking at the derivation of the 0.23.x 2.0.x branches on one hand, and the 1.x branches on the other, as described here: http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E One gets the impression the Hadoop committers are split into two teams, with one team working on 0.23.x/2.0.2 and another team working on 1.x, running the risk of increasingly diverging products eventually competing with each other. Is that the case? I am not sure how you came to this conclusion. The way I see it is, all the folks are working on trunk. Subset of this work from trunk is pushed to older releases such as 1.x or 0.23.x. In Apache Hadoop, features always go to trunk first before going to any older releases 1.x or 0.23.x. That means trunk is a superset of all the features. Is there expected to be a Hadoop 3.0 where the results of the two lines of development will merge or is it increasingly likely the subteams will continue their separate routes? 2.0.3-alpha, which is the latest release based off of trunk, that is in final stage of completion should have all the features that all the other releases have. Let me know if there are any exceptions to this that you know of. Thanks, Glen -- Glen Mazza Talend Community Coders - coders.talend.com blog: www.jroller.com/gmazza -- http://hortonworks.com/download/
Re: On a lighter note
This…is….hilarious lol Cheers, Chris Mattmann From: Anand Sharma anand2sha...@gmail.commailto:anand2sha...@gmail.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Thursday, January 17, 2013 7:09 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.comhttp://cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.commailto:michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.commailto:wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.commailto:donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.comhttp://cloudfront.blogspot.com/
Re: Cohesion of Hadoop team?
On 01/18/2013 11:58 AM, Suresh Srinivas wrote: On Fri, Jan 18, 2013 at 6:48 AM, Glen Mazza gma...@talend.com mailto:gma...@talend.com wrote: Hi, looking at the derivation of the 0.23.x 2.0.x branches on one hand, and the 1.x branches on the other, as described here: http://mail-archives.apache.org/mod_mbox/hadoop-user/201301.mbox/%3CCD0CAB8B.1098F%25evans%40yahoo-inc.com%3E One gets the impression the Hadoop committers are split into two teams, with one team working on 0.23.x/2.0.2 and another team working on 1.x, running the risk of increasingly diverging products eventually competing with each other. Is that the case? I am not sure how you came to this conclusion. The way I see it is, all the folks are working on trunk. Subset of this work from trunk is pushed to older releases such as 1.x or 0.23.x. In Apache Hadoop, features always go to trunk first before going to any older releases 1.x or 0.23.x. That means trunk is a superset of all the features. Is there expected to be a Hadoop 3.0 where the results of the two lines of development will merge or is it increasingly likely the subteams will continue their separate routes? 2.0.3-alpha, which is the latest release based off of trunk, that is in final stage of completion should have all the features that all the other releases have. Let me know if there are any exceptions to this that you know of. I had entered a JIRA here: https://issues.apache.org/jira/browse/HADOOP-9206 . The instructions for single-node setup on 1.1.x are radically different from the instructions for 0.23 and 2.0.2; furthermore, the JARs and folder structure of what you get from the 1.1.x download and what you get with either 0.23.x or 2.0.x-alpha is also considerably different. The deltas here, along with Bobby Evans' explanation of the version histories I linked to above, gave me the impression that 1.x has one team working on it while the other branches have another. If that was the case (as you're not clarifying, it's not) I was then wondering when all committers would be more or less on the same page again. Thanks for the clarification. Glen Thanks, Glen -- Glen Mazza Talend Community Coders -coders.talend.com http://coders.talend.com blog:www.jroller.com/gmazza http://www.jroller.com/gmazza -- http://hortonworks.com/download/ -- Glen Mazza Talend Community Coders - coders.talend.com blog: www.jroller.com/gmazza
RE: On a lighter note
Now if only we really could change the name of secondary namenode... Against the assault of laughter nothing can stand - Mark Twain Original Message Subject: Re: On a lighter note From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Date: Fri, January 18, 2013 10:46 am To: user@hadoop.apache.org user@hadoop.apache.org This…is….hilarious lol Cheers, Chris Mattmann From: Anand Sharma anand2sha...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Thursday, January 17, 2013 7:09 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
Re: On a lighter note
Inspired by this, I would call it the 'Downfall node' ;) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 19, 2013 at 12:14 AM, Chris Folsom jcfol...@pureperfect.comwrote: Now if only we really could change the name of secondary namenode... Against the assault of laughter nothing can stand - Mark Twain Original Message Subject: Re: On a lighter note From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Date: Fri, January 18, 2013 10:46 am To: user@hadoop.apache.org user@hadoop.apache.org This…is….hilarious lol Cheers, Chris Mattmann From: Anand Sharma anand2sha...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Thursday, January 17, 2013 7:09 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
config for high memory jobs does not work, please help.
Dear all, I know it is best to use small amount of mem in mapper and reduce. However, sometimes it is hard to do so. For example, in machine learning algorithms, it is common to load the model into mem in the mapper step. When the model is big, I have to allocate a lot of mem for the mapper. Here is my question: how can I config hadoop so that it does not fork too many mappers and run out of physical memory? My machines have 24G, and I have 100 of them. Each time, hadoop will fork 6 mappers on each machine, no matter what config I used. I really want to reduce it to what ever number I want, for example, just 1 mapper per machine. Here are the config I tried. (I use streaming, and I pass the config in the command line) -Dmapred.child.java.opts=-Xmx8000m -- did not bring down the number of mappers -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number of mappers Am I missing something here? I use Hadoop 0.20.205 Thanks a lot in advance! -Shaojun
Re: config for high memory jobs does not work, please help.
Try: -Dmapred.tasktracker.map.tasks.maximum=1 Although I usually put this parameter in mapred-site.xml. Jeff Dear all, I know it is best to use small amount of mem in mapper and reduce. However, sometimes it is hard to do so. For example, in machine learning algorithms, it is common to load the model into mem in the mapper step. When the model is big, I have to allocate a lot of mem for the mapper. Here is my question: how can I config hadoop so that it does not fork too many mappers and run out of physical memory? My machines have 24G, and I have 100 of them. Each time, hadoop will fork 6 mappers on each machine, no matter what config I used. I really want to reduce it to what ever number I want, for example, just 1 mapper per machine. Here are the config I tried. (I use streaming, and I pass the config in the command line) -Dmapred.child.java.opts=-Xmx8000m -- did not bring down the number of mappers -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number of mappers Am I missing something here? I use Hadoop 0.20.205 Thanks a lot in advance! -Shaojun
RE: On a lighter note
LOL Original Message Subject: Re: On a lighter note From: Mohammad Tariq donta...@gmail.com Date: Fri, January 18, 2013 2:08 pm To: user@hadoop.apache.org user@hadoop.apache.org Inspired by this, I would call it the 'Downfall node' ;) Warm Regards,Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 19, 2013 at 12:14 AM, Chris Folsom jcfol...@pureperfect.com wrote: Now if only we really could change the name of secondary namenode... Against the assault of laughter nothing can stand - Mark Twain Original Message Subject: Re: On a lighter note From: Mattmann, Chris A (388J) chris.a.mattm...@jpl.nasa.gov Date: Fri, January 18, 2013 10:46 am To: user@hadoop.apache.org user@hadoop.apache.org This…is….hilarious lol Cheers, Chris Mattmann From: Anand Sharma anand2sha...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Thursday, January 17, 2013 7:09 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: On a lighter note Awesome one Tariq!! On Fri, Jan 18, 2013 at 6:39 AM, Mohammad Tariq donta...@gmail.com wrote: You are right Michael, as always :) Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Fri, Jan 18, 2013 at 6:33 AM, Michael Segel michael_se...@hotmail.com wrote: I'm thinking 'Downfall' But I could be wrong. On Jan 17, 2013, at 6:56 PM, Yongzhi Wang wang.yongzhi2...@gmail.com wrote: Who can tell me what is the name of the original film? Thanks! Yongzhi On Thu, Jan 17, 2013 at 3:05 PM, Mohammad Tariq donta...@gmail.com wrote: I am sure you will suffer from severe stomach ache after watching this :) http://www.youtube.com/watch?v=hEqQMLSXQlY Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com
Re: config for high memory jobs does not work, please help.
Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)). Some more info: http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/ hth, Arun On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote: Dear all, I know it is best to use small amount of mem in mapper and reduce. However, sometimes it is hard to do so. For example, in machine learning algorithms, it is common to load the model into mem in the mapper step. When the model is big, I have to allocate a lot of mem for the mapper. Here is my question: how can I config hadoop so that it does not fork too many mappers and run out of physical memory? My machines have 24G, and I have 100 of them. Each time, hadoop will fork 6 mappers on each machine, no matter what config I used. I really want to reduce it to what ever number I want, for example, just 1 mapper per machine. Here are the config I tried. (I use streaming, and I pass the config in the command line) -Dmapred.child.java.opts=-Xmx8000m -- did not bring down the number of mappers -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number of mappers Am I missing something here? I use Hadoop 0.20.205 Thanks a lot in advance! -Shaojun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: config for high memory jobs does not work, please help.
I do have this in my command line, and it did not work. -Dmapred.tasktracker.map.tasks.maximum=2 I also tried changing mapred-site.xml, and restart the tasktracker, it did not work either. I am sure it will work if I restart everything, but I really do not want to lose my data on hdfs. So I have not tried restarting everyting. Best regards, -Shaojun On Fri, Jan 18, 2013 at 12:23 PM, Jeffrey Buell jbu...@vmware.com wrote: Try: -Dmapred.tasktracker.map.tasks.maximum=1 Although I usually put this parameter in mapred-site.xml. Jeff Dear all, I know it is best to use small amount of mem in mapper and reduce. However, sometimes it is hard to do so. For example, in machine learning algorithms, it is common to load the model into mem in the mapper step. When the model is big, I have to allocate a lot of mem for the mapper. Here is my question: how can I config hadoop so that it does not fork too many mappers and run out of physical memory? My machines have 24G, and I have 100 of them. Each time, hadoop will fork 6 mappers on each machine, no matter what config I used. I really want to reduce it to what ever number I want, for example, just 1 mapper per machine. Here are the config I tried. (I use streaming, and I pass the config in the command line) -Dmapred.child.java.opts=-Xmx8000m -- did not bring down the number of mappers -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number of mappers Am I missing something here? I use Hadoop 0.20.205 Thanks a lot in advance! -Shaojun
Re: Hadoop Scalability
Obviously the algorithm matters, but here are some very old numbers (things today are much better), but you do see the 'linear' scaling with both nodes and datasets: http://developer.yahoo.com/blogs/hadoop/posts/2009/05/hadoop_sorts_a_petabyte_in_162/ 100TB Sort - 97 mins 1000 TB Sort - 975 mins Arun On Jan 17, 2013, at 7:09 PM, Thiago Vieira wrote: Hello! Is common to see this sentence: Hadoop Scales Linearly. But, is there any performance evaluation to confirm this? In my evaluations, Hadoop processing capacity scales linearly, but not proportional to number of nodes, the processing capacity achieved with 20 nodes is not the double of the processing capacity achieved with 10 nodes. Is there any evaluation about this? Thank you! -- Thiago Vieira -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: config for high memory jobs does not work, please help.
Not sure about EMR, but if you install your own cluster on EC2 you can use the configs mentioned here: http://hadoop.apache.org/docs/stable/capacity_scheduler.html Arun On Jan 18, 2013, at 2:50 PM, Shaojun Zhao wrote: I am using Amazon EC2/EMR. jps give this 16600 JobTracker 2732 RunJar 2504 StatePusher 31902 instance-controller.jar 23553 Jps 22444 RunJar 2077 NameNode I am not sure how I can impose capacityscheduler on ec2/emr machines. -Shaojun On Fri, Jan 18, 2013 at 1:18 PM, Arun C Murthy a...@hortonworks.com wrote: Take a look at the CapacityScheduler and 'High RAM' jobs where-by you can run M map slots per node and request, per-job, that you want N (where N = max(1, N, M)). Some more info: http://hadoop.apache.org/docs/stable/capacity_scheduler.html#Resource+based+scheduling http://hortonworks.com/blog/understanding-apache-hadoops-capacity-scheduler/ hth, Arun On Jan 18, 2013, at 12:05 PM, Shaojun Zhao wrote: Dear all, I know it is best to use small amount of mem in mapper and reduce. However, sometimes it is hard to do so. For example, in machine learning algorithms, it is common to load the model into mem in the mapper step. When the model is big, I have to allocate a lot of mem for the mapper. Here is my question: how can I config hadoop so that it does not fork too many mappers and run out of physical memory? My machines have 24G, and I have 100 of them. Each time, hadoop will fork 6 mappers on each machine, no matter what config I used. I really want to reduce it to what ever number I want, for example, just 1 mapper per machine. Here are the config I tried. (I use streaming, and I pass the config in the command line) -Dmapred.child.java.opts=-Xmx8000m -- did not bring down the number of mappers -Dmapred.cluster.map.memory.mb=32000 -- did not bring down the number of mappers Am I missing something here? I use Hadoop 0.20.205 Thanks a lot in advance! -Shaojun -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/ -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Estimating disk space requirements
It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. The only reasonable reason to split nodes to very small levels is for testing and training. On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Thnx for the reply Ted, You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) About the os partitions I did not exactly understand what you meant. I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. And I am pretty sure it does not have a separate partition for root. Please help me explain what u meant and what else precautions should I take. Thanks, Regards, Ouch Whisper 01010101010 On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote: Where do you find 40gb disks now a days? Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. On Jan 18, 2013, at 8:55 AM, Panshul Whisper ouchwhis...@gmail.com wrote: If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101 -- Regards, Ouch Whisper 010101010101
Re: Estimating disk space requirements
ah now i understand what you mean. I will be creating 20 individual servers on the cloud, and not create one big server and make several virtual nodes inside it. I will be paying for 20 different nodes.. all configured with hadoop and connected to the cluster. Thanx for the intel :) On Fri, Jan 18, 2013 at 11:59 PM, Ted Dunning tdunn...@maprtech.com wrote: It is usually better to not subdivide nodes into virtual nodes. You will generally get better performance form the original node because you only pay for the OS once and because your disk I/O will be scheduled better. If you look at EC2 pricing, however, the spot market often has arbitrage opportunities where one size node is absurdly cheap relative to others. In that case, it pays to scale the individual nodes up or down. The only reasonable reason to split nodes to very small levels is for testing and training. On Fri, Jan 18, 2013 at 2:30 PM, Panshul Whisper ouchwhis...@gmail.comwrote: Thnx for the reply Ted, You can find 40 GB disks when u make virtual nodes on a cloud like Rackspace ;-) About the os partitions I did not exactly understand what you meant. I have made a server on the cloud.. And I just installed and configured hadoop and hbase in the /use/local folder. And I am pretty sure it does not have a separate partition for root. Please help me explain what u meant and what else precautions should I take. Thanks, Regards, Ouch Whisper 01010101010 On Jan 18, 2013 11:11 PM, Ted Dunning tdunn...@maprtech.com wrote: Where do you find 40gb disks now a days? Normally your performance is going to be better with more space but your network may be your limiting factor for some computations. That could give you some paradoxical scaling. Hbase will rarely show this behavior. Keep in mind you also want to allow for an os partition. Current standard practice is to reserve as much as 100 GB for that partition but in your case 10gb better:-) Note that if you account for this, the node counts don't scale as simply. The overhead of these os partitions goes up with number of nodes. On Jan 18, 2013, at 8:55 AM, Panshul Whisper ouchwhis...@gmail.com wrote: If we look at it with performance in mind, is it better to have 20 Nodes with 40 GB HDD or is it better to have 10 Nodes with 80 GB HDD? they are connected on a gigabit LAN Thnx On Fri, Jan 18, 2013 at 2:26 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: 20 nodes with 40 GB will do the work. After that you will have to consider performances based on your access pattern. But that's another story. JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Thank you for the replies, So I take it that I should have atleast 800 GB on total free space on HDFS.. (combined free space of all the nodes connected to the cluster). So I can connect 20 nodes having 40 GB of hdd on each node to my cluster. Will this be enough for the storage? Please confirm. Thanking You, Regards, Panshul. On Fri, Jan 18, 2013 at 1:36 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Panshul, If you have 20 GB with a replication factor set to 3, you have only 6.6GB available, not 11GB. You need to divide the total space by the replication factor. Also, if you store your JSon into HBase, you need to add the key size to it. If you key is 4 bytes, or 1024 bytes, it makes a difference. So roughly, 24 000 000 * 5 * 1024 = 114GB. You don't have the space to store it. Without including the key size. Even with a replication factor set to 5 you don't have the space. Now, you can add some compression, but even with a lucky factor of 50% you still don't have the space. You will need something like 90% compression factor to be able to store this data in your cluster. A 1T drive is now less than $100... So you might think about replacing you 20 GB drives by something bigger. to reply to your last question, for your data here, you will need AT LEAST 350GB overall storage. But that's a bare minimum. Don't go under 500GB. IMHO JM 2013/1/18, Panshul Whisper ouchwhis...@gmail.com: Hello, I was estimating how much disk space do I need for my cluster. I have 24 million JSON documents approx. 5kb each the Json is to be stored into HBASE with some identifying data in coloumns and I also want to store the Json for later retrieval based on the Id data as keys in Hbase. I have my HDFS replication set to 3 each node has Hadoop and hbase and Ubuntu installed on it.. so approx 11 GB is available for use on my 20 GB node. I have no idea, if I have not enabled Hbase replication, is the HDFS replication enough to keep the data safe and redundant. How much total disk space I will need for the storage of the data. Please help me estimate this. Thank you so much. -- Regards, Ouch Whisper 010101010101
Re: Spring for hadoop
Yes, We have used spring hadoop data for our hbase data reading and writing to HBase. We have used the below link for implementation in our project. http://static.springsource.org/spring-hadoop/docs/current/reference/html/hbase.html Thank you, Jilani On Sat, Jan 19, 2013 at 4:06 AM, Mohammad Tariq donta...@gmail.com wrote: You might find this link http://www.springsource.org/spring-data/hadoop useful. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 19, 2013 at 4:04 AM, Panshul Whisper ouchwhis...@gmail.comwrote: Hello, I was wondering if anyone is using spring for hadoop to execute map reduce jobs or to perform hbase operations on a hadoop cluster using spring data for hadoop. Please suggest me a working example as I am unable to find any working sample and spring data documentation is of no use for beginners. Thanks Regards, Ouch Whisper 01010101010
Re: Spring for hadoop
Hi, Please find below URL where you will find the sample code for spring hadoop. https://github.com/SpringSource/spring-hadoop-samples Thank you, Jilani On Sat, Jan 19, 2013 at 11:43 AM, Jilani Shaik jilani2...@gmail.com wrote: Yes, We have used spring hadoop data for our hbase data reading and writing to HBase. We have used the below link for implementation in our project. http://static.springsource.org/spring-hadoop/docs/current/reference/html/hbase.html Thank you, Jilani On Sat, Jan 19, 2013 at 4:06 AM, Mohammad Tariq donta...@gmail.comwrote: You might find this link http://www.springsource.org/spring-data/hadoop useful. Warm Regards, Tariq https://mtariq.jux.com/ cloudfront.blogspot.com On Sat, Jan 19, 2013 at 4:04 AM, Panshul Whisper ouchwhis...@gmail.comwrote: Hello, I was wondering if anyone is using spring for hadoop to execute map reduce jobs or to perform hbase operations on a hadoop cluster using spring data for hadoop. Please suggest me a working example as I am unable to find any working sample and spring data documentation is of no use for beginners. Thanks Regards, Ouch Whisper 01010101010