Re: log
This basically happens while running a mapreduce job. When a map reduce job is triggered the job files are put in hdfs with high replication ( replication is controlled by - 'mapred.submit.replication' default value is 10). The job files are cleaned up after the job is completed and hence that could be the reason you are seeing the hdfs file system status as healthy after running the job. On Fri, Apr 19, 2013 at 1:04 PM, Mohit Vadhera project.linux.p...@gmail.com wrote: its one (1). Output is below. ...Status: HEALTHY Total size:903709673179 B Total dirs:2906 Total files: 0 Total blocks (validated): 20906 (avg. block size 43227287 B) Minimally replicated blocks: 20906 (100.0 %) Over-replicated blocks:0 (0.0 %) Under-replicated blocks: 248 (1.1862624 %) Mis-replicated blocks: 0 (0.0 %) Default replication factor:1 Average block replication: 1.0 Corrupt blocks:0 Missing replicas: 2232 (9.646469 %) Number of data-nodes: 1 Number of racks: 1 FSCK ended at Fri Apr 19 03:47:04 EDT 2013 in 2224 milliseconds The filesystem under path '/' is HEALTHY On Fri, Apr 19, 2013 at 12:28 PM, S, Manoj mano...@intel.com wrote: It means that some of your data blocks are not replicated as intended. What is the value of “dfs.replication” in your hadoop-site.xml file? ** ** Can you paste the output of ** ** *bin/hadoop fsck / ** ** -- Manoj ** ** *From:* Mohit Vadhera [mailto:project.linux.p...@gmail.com] *Sent:* Friday, April 19, 2013 12:09 PM *To:* user@hadoop.apache.org *Subject:* log ** ** Can anybody let me know the meaning of the below log plz Target Replicas is 10 but found 1 replica(s). ? /var/lib/hadoop-hdfs/cache/mapred/mapred/staging/test_user/.staging/job_201302180313_0623/job.split: Under replicated BP-2091347308-172.20.3.119-1356632249303:blk_6297333561560198850_70720. Target Replicas is 10 but found 1 replica(s). Thanks,
Re: How to change secondary namenode location in Hadoop 1.0.4?
Hi Henry You can change the secondary name node storage location by overriding the property 'fs.checkpoint.dir' in your core-site.xml On Wed, Apr 17, 2013 at 2:35 PM, Henry Hung ythu...@winbond.com wrote: Hi All, ** ** What is the property name of Hadoop 1.0.4 to change secondary namenode location? Currently the default in my machine is “/tmp/hadoop-hadoop/dfs/namesecondary”, I would like to change it to “/data/namesecondary” ** ** Best regards, Henry -- The privileged confidential information contained in this email is intended for use only by the addressees as indicated by the original sender of this email. If you are not the addressee indicated in this email or are not responsible for delivery of the email to such a person, please kindly reply to the sender indicating this fact and delete all copies of it from your computer and network server immediately. Your cooperation is highly appreciated. It is advised that any unauthorized use of confidential information of Winbond is strictly prohibited; and any information in this email irrelevant to the official business of Winbond shall be deemed as neither given nor endorsed by Winbond.
Re: Adjusting tasktracker heap size?
Hi Marcos, You need to consider the slots based on the available memory Available Memory = Total RAM - (Memory for OS + Memory for Hadoop Daemons like DN,TT + Memory for other servicess if any running in that node) Now you need to consider the generic MR jobs planned on your cluster. Say if your tasks need 1G of JVM to run gracefully, then Possible number of slots = Available Memory / JVM size of each task Now divide the slots between mappers and reducers. On Mon, Apr 15, 2013 at 11:38 PM, Amal G Jose amalg...@gmail.com wrote: It depends on the type of job that is frequently submitting. RAM size of the machine. Heap size of tasktracker= (mapslots+reduceslots)*jvm size We can adjust this according to our requirement to fine tune our cluster. This is my thought. On Mon, Apr 15, 2013 at 4:40 PM, MARCOS MEDRADO RUBINELLI marc...@buscapecompany.com wrote: Hi, I am currently tuning a cluster, and I haven't found much information on what factors to consider while adjusting the heap size of tasktrackers. Is it a direct multiple of the number of map+reduce slots? Is there anything else I should consider? Thank you, Marcos
Re: Submitting mapreduce and nothing happens
Hi Amit Are you seeing any errors or warnings on JT logs? Regards Bejoy KS
Re: VM reuse!
Hi Rahul If you look at larger cluster and jobs that involve larger input data sets. The data would be spread across the whole cluster, and a single node might have various blocks of that entire data set. Imagine you have a cluster with 100 map slots and your job has 500 map tasks, now in that case there should be multiple map tasks in a single task tracker based on slot availability. Here if you enable jvm reuse, all tasks related to a job on a single TaskTracker would use the same jvm. The benefit here is just the time you are saving in spawning and cleaning up jvm for individual tasks. On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I have a question related to VM reuse in Hadoop.I now understand the purpose of VM reuse , but I am wondering how is it useful. Example. for VM reuse to be effective or kicked in , we need more than one mapper task to be submitted to a single node (for the same job).Hadoop would consider spawning mappers into nodes which actually contains the data , it might rarely happen that multiple mappers are allocated to a single task tracker. And even if a single task nodes gets to run multiple mappers then it might as well run in parallel in multiple VM rather than sequentially in a single VM. I am sure I am missing some link here , please help me find that. Thanks, Rahul
Re: HW infrastructure for Hadoop
+1 for Hadoop Operations On Tue, Apr 16, 2013 at 3:57 PM, MARCOS MEDRADO RUBINELLI marc...@buscapecompany.com wrote: Tadas, Hadoop Operations has pretty useful, up-to-date information. The chapter on hardware selection is available here: http://my.safaribooksonline.com/book/databases/hadoop/9781449327279/4dot-planning-a-hadoop-cluster/id2760689 Regards, Marcos Em 16-04-2013 07:13, Tadas Makčinskas escreveu: We are thinking to distribute like 50 node cluster. And trying to figure out what would be a good HW infrastructure (Disks – I/O‘s, RAM, CPUs, network). I cannot actually come around any examples that people ran and found it working well and cost effectively. ** ** If anybody could share their best considered infrastructure. Would be a tremendous help not trying to figure it out on our own. ** ** Regards, Tadas ** ** ** **
Re: VM reuse!
When you process larger data volumes, this is the case mostly. :) Say you have a job with smaller input size and if you have 2 blocks on a single node and then the JT may schedule two tasks on the same TT if there are available free slots. So those tasks can take advantage of JVM reuse. Which TT the JT would assign tasks is totally dependent on data locality and availability of task slots. On Tue, Apr 16, 2013 at 5:03 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Ok, Thanks Bejoy. Only in some typical scenarios it's possible , like the one that you have mentioned. Much more number of mappers and less number of mappers slots. Regards, Rahul On Tue, Apr 16, 2013 at 2:40 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Rahul If you look at larger cluster and jobs that involve larger input data sets. The data would be spread across the whole cluster, and a single node might have various blocks of that entire data set. Imagine you have a cluster with 100 map slots and your job has 500 map tasks, now in that case there should be multiple map tasks in a single task tracker based on slot availability. Here if you enable jvm reuse, all tasks related to a job on a single TaskTracker would use the same jvm. The benefit here is just the time you are saving in spawning and cleaning up jvm for individual tasks. On Tue, Apr 16, 2013 at 2:04 PM, Rahul Bhattacharjee rahul.rec@gmail.com wrote: Hi, I have a question related to VM reuse in Hadoop.I now understand the purpose of VM reuse , but I am wondering how is it useful. Example. for VM reuse to be effective or kicked in , we need more than one mapper task to be submitted to a single node (for the same job).Hadoop would consider spawning mappers into nodes which actually contains the data , it might rarely happen that multiple mappers are allocated to a single task tracker. And even if a single task nodes gets to run multiple mappers then it might as well run in parallel in multiple VM rather than sequentially in a single VM. I am sure I am missing some link here , please help me find that. Thanks, Rahul
Re: guessing number of reducers.
Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: jamal sasha jamalsha...@gmail.com Date: Wed, 21 Nov 2012 11:38:38 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one part... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks
Re: guessing number of reducers.
Hi Manoj If you intend to calculate the number of reducers based on the input size, then in your driver class you should get the size of the input dir in hdfs and say you intended to give n bytes to a reducer then the number of reducers can be computed as Total input size/ bytes per reducer. You can round this value and use it to set the number of reducers in conf programatically. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Manoj Babu manoj...@gmail.com Date: Wed, 21 Nov 2012 23:28:00 To: user@hadoop.apache.org Cc: bejoy.had...@gmail.combejoy.had...@gmail.com Subject: Re: guessing number of reducers. Hi, How to set no of reducers in job conf dynamically? For example some days i am getting 500GB of data on heavy traffic and some days 100GB only. Thanks in advance! Cheers! Manoj. On Wed, Nov 21, 2012 at 11:19 PM, Kartashov, Andy andy.kartas...@mpac.cawrote: Bejoy, I’ve read somethere about keeping number of mapred.reduce.tasks below the reduce task capcity. Here is what I just tested: Output 25Gb. 8DN cluster with 16 Map and Reduce Task Capacity: 1 Reducer – 22mins 4 Reducers – 11.5mins 8 Reducers – 5mins 10 Reducers – 7mins 12 Reducers – 6:5mins 16 Reducers – 5.5mins 8 Reducers have won the race. But Reducers at the max capacity was very clos. J AK47 *From:* Bejoy KS [mailto:bejoy.had...@gmail.com] *Sent:* Wednesday, November 21, 2012 11:51 AM *To:* user@hadoop.apache.org *Subject:* Re: guessing number of reducers. Hi Sasha In general the number of reduce tasks is chosen mainly based on the data volume to reduce phase. In tools like hive and pig by default for every 1GB of map output there will be a reducer. So if you have 100 gigs of map output then 100 reducers. If your tasks are more CPU intensive then you need lesser volume of data per reducer for better performance results. In general it is better to have the number of reduce tasks slightly less than the number of available reduce slots in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. -- *From: *jamal sasha jamalsha...@gmail.com *Date: *Wed, 21 Nov 2012 11:38:38 -0500 *To: *user@hadoop.apache.orguser@hadoop.apache.org *ReplyTo: *user@hadoop.apache.org *Subject: *guessing number of reducers. By default the number of reducers is set to 1.. Is there a good way to guess optimal number of reducers Or let's say i have tbs worth of data... mappers are of order 5000 or so... But ultimately i am calculating , let's say, some average of whole data... say average transaction occurring... Now the output will be just one line in one part... rest of them will be empty.So i am guessing i need loads of reducers but then most of them will be empty but at the same time one reducer won't suffice.. What's the best way to solve this.. How to guess optimal number of reducers.. Thanks NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le présent courriel et toute pièce jointe qui l'accompagne sont confidentiels, protégés par le droit d'auteur et peuvent être couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autorisée est interdite. Si vous n'êtes pas le destinataire prévu de ce courriel, supprimez-le et contactez immédiatement l'expéditeur. Veuillez penser à l'environnement avant d'imprimer le présent courriel
Re: fundamental doubt
Hi Jamal It is performed at a frame work level map emits key value pairs and the framework collects and groups all the values corresponding to a key from all the map tasks. Now the reducer takes the input as a key and a collection of values only. The reduce method signature defines it. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: jamal sasha jamalsha...@gmail.com Date: Wed, 21 Nov 2012 14:50:51 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: fundamental doubt Hi.. I guess i am asking alot of fundamental questions but i thank you guys for taking out time to explain my doubts. So i am able to write map reduce jobs but here is my mydoubt As of now i am writing mappers which emit key and a value This key value is then captured at reducer end and then i process the key and value there. Let's say i want to calculate the average... Key1 value1 Key2 value 2 Key 1 value 3 So the output is something like Key1 average of value 1 and value 3 Key2 average 2 = value 2 Right now in reducer i have to create a dictionary with key as original keys and value is a list. Data = defaultdict(list) == // python usrr But i thought that Mapper takes in the key value pairs and outputs key: ( v1,v2)and Reducer takes in this key and list of values and returns Key , new value.. So why is the input of reducer the simple output of mapper and not the list of all the values to a particular key or did i understood something. Am i making any sense ??
Re: Supplying a jar for a map-reduce job
Hi Pankaj AFAIK You can do the same. Just provide the properties like mapper class, reducer class, input format, output format etc using -D option at run time. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Pankaj Gupta pan...@brightroll.com Date: Tue, 20 Nov 2012 20:49:29 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Supplying a jar for a map-reduce job Hi, I am running map-reduce jobs on Hadoop 0.23 cluster. Right now I supply the jar to use for running the map-reduce job using the setJarByClass function on org.apache.hadoop.mapreduce.Job. This makes my code depend on a class in the MR job at compile. What I want is to be able to run an MR job without being dependent on it at compile time. It would be great if I could use a jar that contains the Mapper and Reducer classes and just pass it to run the map reduce job. That would make it easy to choose an MR job to run at runtime. Is that possible? Thanks in Advance, Pankaj
Re: Strange error in Hive
Hi Mark I noticed,there is no 'Select' clause seen in 'Insert Overwrite'. I believe your table is using a HiveHbase Storage Handler. Ensure that the required jars are given in hive --auxpath. You'll require the following jars Hive Hbase Handler jar Hbase jar Zookeeper jar Guava jar Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Mark Kerzner mark.kerz...@shmsoft.com Date: Wed, 14 Nov 2012 17:05:20 To: Hadoop Useruser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Strange error in Hive Hi, I am trying to insert a table in hive, and I am getting this strange error. Here is what I do insert overwrite table hivetable struct(lpad(ch, 20, ' '),lpad(start, 10, 0),lpad(strand,10,' '),lpad(ref, 3, ' ')), struct(X,mmm,c_count,t_count,mm) from atable; and here is what I get. Any and all ideas are welcome :) Thank you, Mark java.lang.ClassNotFoundException: org/apache/hadoop/hive/hbase/HBaseSerDe Continuing ... java.lang.ClassNotFoundException: org/apache/hadoop/hive/hbase/HiveHBaseTableInputFormat Continuing ... java.lang.ClassNotFoundException: org/apache/hadoop/hive/hbase/HiveHBaseTableOutputFormat Continuing ... java.lang.NullPointerException Continuing ... java.lang.NullPointerException at org.apache.hadoop.hive.ql.exec.FileSinkOperator.initializeOp(FileSinkOperator.java:280) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.SelectOperator.initializeOp(SelectOperator.java:62) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:433) at org.apache.hadoop.hive.ql.exec.Operator.initializeChildren(Operator.java:389) at org.apache.hadoop.hive.ql.exec.TableScanOperator.initializeOp(TableScanOperator.java:133) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.MapOperator.initializeOp(MapOperator.java:444) at org.apache.hadoop.hive.ql.exec.Operator.initialize(Operator.java:357) at org.apache.hadoop.hive.ql.exec.ExecMapper.configure(ExecMapper.java:98) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ReflectionUtils.setJobConf(ReflectionUtils.java:88) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:64) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:117) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:387) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:416) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260)
Re: Setting up a edge node to submit jobs
Hi Manoj For an edge node, you need to include the hadoop jars and configuration files in that box like any other node(Use the same version your cluster has). But no need to start any hadoop daemons. You need to ensure that this node is able to connect with all machines in the cluster. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Manoj Babu manoj...@gmail.com Date: Thu, 15 Nov 2012 10:03:24 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Setting up a edge node to submit jobs Hi, How to setup a edge node for a hadoop cluster to submit jobs? Thanks in advance! Cheers! Manoj.
Re: Map-Reduce V/S Hadoop Ecosystem
Hi Yogesh, The development time in Pig and hive are pretty less compared to its equivalent mapreduce code and for generic cases it is very efficient. If your requirement is that complex and you need very low level control of your code mapreduce is better. If you are an expert in mapreduce your code can be efficient as yours would very specific to your app but the MR in hive and pig may be more generic. To just write your custom mapreduce functions, just basic knowledge on java is good. As you are better with java you can understand the internals better. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: yogesh.kuma...@wipro.com Date: Wed, 7 Nov 2012 15:33:07 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Map-Reduce V/S Hadoop Ecosystem Hello Hadoop Champs, Please give some suggestion.. As Hadoop Ecosystem(Hive, Pig...) internally do Map-Reduce to process. My Question is 1). where Map-Reduce program(written in Java, python etc) are overtaking Hadoop Ecosystem. 2). Limitations of Hadoop Ecosystem comparing with Writing Map-Reduce program. 3) for writing Map-Reduce jobs in java how much we need to have skills in java out of 10 (?/10) Please put some light over it. Thanks Regards Yogesh Kumar The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Data locality of map-side join
Hi Sigurd Mapside joins are efficiently implemented in Hive and Pig. I'm talking in terms of how mapside joins are implemented in hive. In map side join, the smaller data set is first loaded into DistributedCache. The larger dataset is streamed as usual and the smaller dataset in memory. For every record in larger data set the look up is made in memory on the smaller set and there by joins are done. In later versions of hive the hive framework itself intelligently determines the smaller data set. In older versions you can specify the smaller data set using some hints in query. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Sigurd Spieckermann sigurd.spieckerm...@gmail.com Date: Mon, 22 Oct 2012 22:29:15 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Data locality of map-side join Hi guys, I've been trying to figure out whether a map-side join using the join-package does anything clever regarding data locality with respect to at least one of the partitions to join. To be more specific, if I want to join two datasets and some partition of dataset A is larger than the corresponding partition of dataset B, does Hadoop account for this and try to ensure that the map task is executed on the datanode storing the bigger partition thus reducing data transfer (if the other partition does not happen to be located on that same datanode)? I couldn't conclude the one or the other behavior from the source code and I couldn't find any documentation about this detail. Thanks for clarifying! Sigurd
Re: Old vs New API
Hi alberto The new mapreduce API is coming to shape now. The majority of the classes available in old API has been ported to new API as well. The Old mapred API was marked depreciated in an earlier version of hadoop (0.20.x) but later it was un-depreciated as all the functionality in old API was not available in new mapreduce API at that point. Now mapreduce API is pretty good and you can go ahead with that for development. AFAIK mapreduce API is the future. Let's wait for a commiter to officially comment on this. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Alberto Cordioli cordioli.albe...@gmail.com Date: Mon, 22 Oct 2012 15:22:41 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Old vs New API Hi all, I am using last stable Hadoop version (1.0.3) and I am implementing right now my first MR jobs. I read about the presence of 2 API: the old and the new one. I read some stuff about them, but I am not able to find quite fresh news. I read that the old api was deprecated, but in my version they do not seem to. Moreover the new api does not have all the features implemented (see for example the package contrib with its classes to do joins). I found this post on the ML: http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201002.mbox/%3ca6906bde1002230730s24d6092av1e57b46bad806...@mail.gmail.com%3E but it is very old (2010) and I think that further changes have been made meanwhile. My question is: does make sense to use the new api, instead of the old one? Does this new version providing other functionalities with respect to the older one? Or, given the slow progress in implementation, is better to use the old api? Thanks.
Re: extracting lzo compressed files
Hi Manoj You can get the file in a readable format using hadoop fs -text fileName Provided you have lzo codec within the property 'io.compression.codecs' in core-site.xml A 'hadoop fs -ls' command would itself display the file size. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Manoj Babu manoj...@gmail.com Date: Sun, 21 Oct 2012 13:10:55 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: extracting lzo compressed files Hi, Is there any option to extract the lzo compressed file in HDFS from command line and any option to find the original size of the compressed file. Thanks in Advance! Cheers! Manoj.
Re: Hadoop counter
Hi Jay Counters are reported at the end of a task to JT. So if a task fails the counters from that task are not send to JT and hence won't be included in the final value of counters from that Job. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Jay Vyas jayunit...@gmail.com Date: Fri, 19 Oct 2012 10:18:42 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: Hadoop counter Ah this answers alot about why some of my dynamic counters never show up and i have to bite my nails waiting to see whats going on until the end of the job- thanks. Another question: what happens if a task fails ? What happen to the counters for it ? Do they dissappear into the ether? Or do they get merged in with the counters from other tasks? On Fri, Oct 19, 2012 at 9:50 AM, Bertrand Dechoux decho...@gmail.comwrote: And by default the number of counters is limited to 120 with the mapreduce.job.counters.limit property. They are useful for displaying short statistics about a job but should not be used for results (imho). I know people may misuse them but I haven't tried so I wouldn't be able to list the caveats. Regards Bertrand On Fri, Oct 19, 2012 at 4:35 PM, Michael Segel michael_se...@hotmail.comwrote: As I understand it... each Task has its own counters and are independently updated. As they report back to the JT, they update the counter(s)' status. The JT then will aggregate them. In terms of performance, Counters take up some memory in the JT so while its OK to use them, if you abuse them, you can run in to issues. As to limits... I guess that will depend on the amount of memory on the JT machine, the size of the cluster (Number of TT) and the number of counters. In terms of global accessibility... Maybe. The reason I say maybe is that I'm not sure by what you mean by globally accessible. If a task creates and implements a dynamic counter... I know that it will eventually be reflected in the JT. However, I do not believe that a separate Task could connect with the JT and see if the counter exists or if it could get a value or even an accurate value since the updates are asynchronous. Not to mention that I don't believe that the counters are aggregated until the job ends. It would make sense that the JT maintains a unique counter for each task until the tasks complete. (If a task fails, it would have to delete the counters so that when the task is restarted the correct count is maintained. ) Note, I haven't looked at the source code so I am probably wrong. HTH Mike On Oct 19, 2012, at 5:50 AM, Lin Ma lin...@gmail.com wrote: Hi guys, I have some quick questions regarding to Hadoop counter, - Hadoop counter (customer defined) is global accessible (for both read and write) for all Mappers and Reducers in a job? - What is the performance and best practices of using Hadoop counters? I am not sure if using Hadoop counters too heavy, there will be performance downgrade to the whole job? regards, Lin -- Bertrand Dechoux -- Jay Vyas http://jayunit100.blogspot.com
Re: Hadoop installation on mac
Hi Suneel You can get the latest stable versions of hadoop from the following url http://hadoop.apache.org/releases.html#Download to download choose a mirror and slect the stable versions (the ones Harsh suggested) you like to go for. (the 1.0.x releases are the current stable versions) Regards Bejoy KS -- View this message in context: http://hadoop-common.472056.n3.nabble.com/Hadoop-installation-on-mac-tp3999520p3999535.html Sent from the Users mailing list archive at Nabble.com.
Re: document on hdfs
Hi Murthy Hadoop - The definitive Guide by Tom White has the details on file write anatomy. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: murthy nvvs murthy_n1...@yahoo.com Date: Wed, 10 Oct 2012 04:27:58 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: document on hdfs Hi All, Iam new to Hadoop, i just want to know the writing of files into datanodes in depth. means the file is divided into blocks again the blocks are divided into packets. i need some detailed doc abt the packets movement by using Datapackets Acknowledge packets. Thanks Regards, Murthy
Re: stable release of hadoop
Hi Nisha The current stable version is the 1.0.x releases. This is well suited for production environments. 0.23.x/2.x.x releases is of alpha quality and hence not that recommended on production. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: nisha nishakulkarn...@gmail.com Date: Tue, 9 Oct 2012 17:09:52 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: stable release of hadoop 17 September, 2012: Release 0.23.3 is this release a stable one and can it be used in production...
Re: What is the difference between Rack-local map tasks and Data-local map tasks?
Definitely, If data local map tasks are more the performance will be improved much. Ideally if data is uniformly distributed across DNs and if you have enough number of map task slots on colocated TTs then most of your map tasks should be Data Local. You may have just a few non data local map tasks when the number of input splits/map tasks are large which is quite common. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: centerqi hu cente...@gmail.com Date: Sun, 7 Oct 2012 23:28:55 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: What is the difference between Rack-local map tasks and Data-local map tasks? Very good explanation, If there is a way to reduce Rack-local map tasks but can increase the Data-local map tasks , Whether to increase performance? 2012/10/7 Michael Segel michael_se...@hotmail.com Rack local means that while the data isn't local to the node running the task, it is still on the same rack. (Its meaningless unless you've set up rack awareness because all of the machines are on the default rack. ) Data local means that the task is running local to the machine that contains the actual data. HTH -Mike On Oct 7, 2012, at 8:56 AM, centerqi hu cente...@gmail.com wrote: hi all When I run hadoop job -status xxx,Output the following some list. Rack-local map tasks=124 Data-local map tasks=6 What is the difference between Rack-local map tasks and Data-local map tasks? -- cente...@gmail.com|Sam -- cente...@gmail.com|齐忠
Re: Multiple Aggregate functions in map reduce program
Hi It is definitely possible. In your map make the dept name as the output key and salary as the value. In the reducer for every key you can initialize a counter and a sum. Add on to the sum for all values and increment the counter by 1 for each value. Output the dept key and the new aggregated sum and count for each key. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: iwannaplay games funnlearnfork...@gmail.com Date: Fri, 5 Oct 2012 12:32:28 To: useru...@hbase.apache.org; u...@hadoop.apache.org; hdfs-userhdfs-user@hadoop.apache.org Reply-To: u...@hadoop.apache.org Subject: Multiple Aggregate functions in map reduce program Hi All, I have to get the count and sum of data for eg if my table is *employeename salary department* A 1000 testing B 2000 testing C 3000 development D 4000 testing E 1000 development F 5000 management I want result like Department TotalSalary count(employees) testing7000 3 development 4000 2 management 5000 1 Please let me know whether it is possible to write a java map reduce for this.I tried this on hive.It takes time for big data.I heard map reduce java code will b faster.IS it true???Or i should go for pig programming?? Please guide.. Regards Prabhjot
Re: Multiple Aggregate functions in map reduce program
Hi It is definitely possible. In your map make the dept name as the output key and salary as the value. In the reducer for every key you can initialize a counter and a sum. Add on to the sum for all values and increment the counter by 1 for each value. Output the dept key and the new aggregated sum and count for each key. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: iwannaplay games funnlearnfork...@gmail.com Date: Fri, 5 Oct 2012 12:32:28 To: useru...@hbase.apache.org; user@hadoop.apache.org; hdfs-userhdfs-u...@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Multiple Aggregate functions in map reduce program Hi All, I have to get the count and sum of data for eg if my table is *employeename salary department* A 1000 testing B 2000 testing C 3000 development D 4000 testing E 1000 development F 5000 management I want result like Department TotalSalary count(employees) testing7000 3 development 4000 2 management 5000 1 Please let me know whether it is possible to write a java map reduce for this.I tried this on hive.It takes time for big data.I heard map reduce java code will b faster.IS it true???Or i should go for pig programming?? Please guide.. Regards Prabhjot
Re: hadoop memory settings
Hi Sadak AFAIK HADOOP_HEAPSIZE determines the jvm size of the daemons like NN,JT,TT,DN etc. mapred.child.java.opts and mapred.child.ulimit is used to set the jvm heap for child jvms launched for each map/reduce task launched. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Visioner Sadak visioner.sa...@gmail.com Date: Fri, 5 Oct 2012 13:47:24 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: hadoop memory settings coz i m getting Error occurred during initialization of VM hadoop java.lang.Throwable: Child Error At org.apache.hadoop.mapred.TaskRunner.run whe running a job.:) On Fri, Oct 5, 2012 at 1:39 PM, Visioner Sadak visioner.sa...@gmail.comwrote: Is ther a relation between HADOOP_HEAPSIZE mapred.child.java.opts and mapred.child.ulimit settings in hadoop-env.sh and mapred-site.xml i have a sinngle machine with 2gb ram and running hadoop on psuedo distr mode my HADOOP_HEAPSIZE is set to 256 wat shud i set mapred.child.java.opts and mapred.child.ulimit and how these settings are calculated if my ram is incresed or machine clusters are increased
Re: copyFromLocal
Hi Sadak If you are issuing copyFromLocal from a client/edge node you can copy the files available in the client's lfs to hdfs in cluster. The client/edge node could be a box that has all the hadoop jars and config files exactly same as that of the cluster and the cluster nodes should be accessible from this client. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Kartashov, Andy andy.kartas...@mpac.ca Date: Thu, 4 Oct 2012 16:51:35 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: RE: copyFromLocal I use -put -get commands to bring files in/our of HDFS from/to my home directory on EC2. Then use WinSCP to download files to my laptop. Andy Kartashov MPAC Architecture RD, Co-op 1340 Pickering Parkway, Pickering, L1V 0C4 * Phone : (905) 837 6269 * Mobile: (416) 722 1787 andy.kartas...@mpac.camailto:andy.kartas...@mpac.ca From: Visioner Sadak [mailto:visioner.sa...@gmail.com] Sent: Thursday, October 04, 2012 11:53 AM To: user@hadoop.apache.org Subject: copyFromLocal guys i have hadoop installled in a remote box ... does copyFromLocal method copies data from tht local box only wht if i have to copy data from uses desktop pc(for example E drive) thru my my web application will i have to first copy data to tht remote box using some java code then use copyFromLocal method to copy in to hadoop NOTICE: This e-mail message and any attachments are confidential, subject to copyright and may be privileged. Any unauthorized use, copying or disclosure is prohibited. If you are not the intended recipient, please delete and contact the sender immediately. Please consider the environment before printing this e-mail. AVIS : le pr?sent courriel et toute pi?ce jointe qui l'accompagne sont confidentiels, prot?g?s par le droit d'auteur et peuvent ?tre couverts par le secret professionnel. Toute utilisation, copie ou divulgation non autoris?e est interdite. Si vous n'?tes pas le destinataire pr?vu de ce courriel, supprimez-le et contactez imm?diatement l'exp?diteur. Veuillez penser ? l'environnement avant d'imprimer le pr?sent courriel
Re: How to lower the total number of map tasks
Hi You need to alter the value of mapred.max.split size to a value larger than your block size to have less number of map tasks than the default. On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote: I am running Hadoop 1.0.3 in Pseudo distributed mode. When I submit a map/reduce job to process a file of size about 16 GB, in job.xml, I have the following mapred.map.tasks =242 mapred.min.split.size =0 dfs.block.size = 67108864 I would like to reduce mapred.map.tasks to see if it improves performance. I have tried doubling the size of dfs.block.size. But themapred.map.tasks remains unchanged. Is there a way to reduce mapred.map.tasks ? Thanks in advance for any assistance ! Shing
Re: How to lower the total number of map tasks
Sorry for the typo, the property name is mapred.max.split.size Also just for changing the number of map tasks you don't need to modify the hdfs block size. On Tue, Oct 2, 2012 at 10:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi You need to alter the value of mapred.max.split size to a value larger than your block size to have less number of map tasks than the default. On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote: I am running Hadoop 1.0.3 in Pseudo distributed mode. When I submit a map/reduce job to process a file of size about 16 GB, in job.xml, I have the following mapred.map.tasks =242 mapred.min.split.size =0 dfs.block.size = 67108864 I would like to reduce mapred.map.tasks to see if it improves performance. I have tried doubling the size of dfs.block.size. But themapred.map.tasks remains unchanged. Is there a way to reduce mapred.map.tasks ? Thanks in advance for any assistance ! Shing
Re: How to lower the total number of map tasks
Hi Shing Is your input a single file or set of small files? If latter you need to use CombineFileInputFormat. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Shing Hing Man mat...@yahoo.com Date: Tue, 2 Oct 2012 10:38:59 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: How to lower the total number of map tasks I have tried Configuration.setInt(mapred.max.split.size,134217728); and setting mapred.max.split.size in mapred-site.xml. ( dfs.block.size is left unchanged at 67108864). But in the job.xml, I am still getting mapred.map.tasks =242 . Shing From: Bejoy Ks bejoy.had...@gmail.com To: user@hadoop.apache.org; Shing Hing Man mat...@yahoo.com Sent: Tuesday, October 2, 2012 6:03 PM Subject: Re: How to lower the total number of map tasks Sorry for the typo, the property name is mapred.max.split.size Also just for changing the number of map tasks you don't need to modify the hdfs block size. On Tue, Oct 2, 2012 at 10:31 PM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi You need to alter the value of mapred.max.split size to a value larger than your block size to have less number of map tasks than the default. On Tue, Oct 2, 2012 at 10:04 PM, Shing Hing Man mat...@yahoo.com wrote: I am running Hadoop 1.0.3 in Pseudo distributed mode. When I submit a map/reduce job to process a file of size about 16 GB, in job.xml, I have the following mapred.map.tasks =242 mapred.min.split.size =0 dfs.block.size = 67108864 I would like to reduce mapred.map.tasks to see if it improves performance. I have tried doubling the size of dfs.block.size. But themapred.map.tasks remains unchanged. Is there a way to reduce mapred.map.tasks ? Thanks in advance for any assistance ! Shing
Re: Add file to distributed cache
Hi Abshiek You can find a simple example of using Distributed Cache here http://kickstarthadoop.blogspot.co.uk/2011/05/hadoop-for-dependent-data-splits-using.html --Original Message-- From: Abhishek To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Add file to distributed cache Sent: Oct 2, 2012 05:44 Hi all How do you add a small file to distributed cache in MR program Regards Abhi Sent from my iPhone Regards Bejoy KS Sent from handheld, please excuse typos.
Re: File block size use
Hi Anna If you want to increase the block size of existing files. You can use a Identity Mapper with no reducer. Set the min and max split sizes to your requirement (512Mb). Use SequenceFileInputFormat and SequenceFileOutputFormat for your job. Your job should be done. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Chris Nauroth cnaur...@hortonworks.com Date: Mon, 1 Oct 2012 21:12:58 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: File block size use Hello Anna, If I understand correctly, you have a set of multiple sequence files, each much smaller than the desired block size, and you want to concatenate them into a set of fewer files, each one more closely aligned to your desired block size. Presumably, the goal is to improve throughput of map reduce jobs using those files as input by running fewer map tasks, reading a larger number of input records. Whenever I've had this kind of requirement, I've run a custom map reduce job to implement the file consolidation. In my case, I was typically working with TextInputFormat (not sequence files). I used IdentityMapper and a custom reducer that passed through all values but with key set to NullWritable, because the keys (input file offsets in the case of TextInputFormat) were not valuable data. For my input data, this was sufficient to achieve fairly even distribution of data across the reducer tasks, and I could reasonably predict the input data set size, so I could reasonably set the number of reducers and get decent results. (This may or may not be true for your data set though.) A weakness of this approach is that the keys must pass from the map tasks to the reduce tasks, only to get discarded before writing the final output. Also, the distribution of input records to reduce tasks is not truly random, and therefore the reduce output files may be uneven in size. This could be solved by writing NullWritable keys out of the map task instead of the reduce task and writing a custom implementation of Partitioner to distribute them randomly. To expand on this idea, it could be possible to inspect the FileStatus of each input, sum the values of FileStatus.getLen(), and then use that information to make a decision about how many reducers to run (and therefore approximately set a target output file size). I'm not aware of any built-in or external utilities that do this for you though. Hope this helps, --Chris On Mon, Oct 1, 2012 at 11:30 AM, Anna Lahoud annalah...@gmail.com wrote: I would like to be able to resize a set of inputs, already in SequenceFile format, to be larger. I have tried 'hadoop distcp -Ddfs.block.size=$[64*1024*1024]' and did not get what I expected. The outputs were exactly the same as the inputs. I also tried running a job with an IdentityMapper and IdentityReducer. Although that approaches a better solution, it still requires that I know in advance how many reducers I need to get better file sizes. I was looking at the SequenceFile.Writer constructors and noticed that there are block size parameters that can be used. Using a writer constructed with a 512MB block size, there is nothing that splits the output and I simply get a single file the size of my inputs. What is the current standard for combining sequence files to create larger files for map-reduce jobs? I have seen code that tracks what it writes into the file, but that seems like the long version. I am hoping there is a shorter path. Thank you. Anna
Re: Programming Question / Joining Dataset
Hi Oliver I have scribbled a small post on reduce side joins , the implementation matches with your requirement http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html Regards Bejoy KS
Re: Unit tests for Map and Reduce functions.
Hi Ravi You can take a look at mockito http://books.google.co.in/books?id=Nff49D7vnJcCpg=PA138lpg=PA138dq=mockito+%2B+hadoopsource=blots=IifyVu7yXpsig=Q1LoxqAKO0nqRquus8jOW5CBiWYhl=ensa=Xei=b2pjULHSOIPJrAeGsIHwAgved=0CC0Q6AEwAg#v=onepageq=mockito%20%2B%20hadoopf=false On Thu, Sep 27, 2012 at 2:09 AM, Kai Voigt k...@123.org wrote: I don't know any other unit testing framework. Kai Am 26.09.2012 um 22:37 schrieb Ravi P hadoo...@outlook.com: Thanks Kai, I am exploring MRunit . Are there any other options/ways to write unit tests for Map and Reduce functions. Would like to evaluate all options. - Ravi -- From: hadoo...@outlook.com To: k...@123.org Subject: RE: Unit tests for Map and Reduce functions. Date: Wed, 26 Sep 2012 13:35:57 -0700 Thanks Kai, Which MRUnit jar I should use for Hadoop 0.20 ? https://repository.apache.org/content/repositories/releases/org/apache/mrunit/mrunit/0.9.0-incubating/ - Ravi -- From: k...@123.org Subject: Re: Unit tests for Map and Reduce functions. Date: Wed, 26 Sep 2012 22:21:06 +0200 To: user@hadoop.apache.org Hello, yes, http://mrunit.apache.org is your reference. MRUnit is a framework on top of JUnit which emulates the mapreduce framework to test your mappers and reducers. Kai Am 26.09.2012 um 22:18 schrieb Ravi P hadoo...@outlook.com: Is it possible to write unit test for mapper Map , and reducer Reduce function ? - Ravi -- Kai Voigt k...@123.org -- Kai Voigt k...@123.org
Re: Help on a Simple program
Hi If you don't want either key or value in the output, just make the corresponding data types as NullWritable. Since you just need to filter out a few records/itemd from your logs, reduce phase is not mandatory just a mappper would suffice your needs. From your mapper just output the records that match your criteria. Also set number of reduce tasks to zero in your driver class to completely avoid the reduce phase. A sample code would look like public static class Map extends MapperLongWritable, Text, Text, NullWritable { private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if(-1 != meetConditions(value)) { context.write(value, NullWritable.*get*()); } } } Om your driver class *job.setNumReduceTasks(0);* * * *Alternatively you can specify this st runtime as* hadoop jar xyz.jar com.*.*.* –D mapred.reduce.tasks=0 input/ output/ On Tue, Sep 25, 2012 at 11:38 PM, Matthieu Labour matth...@actionx.comwrote: Hi I am completely new to Hadoop and I am trying to address the following simple application. I apologize if this sounds trivial. I have multiple log files I need to read the log files and collect the entries that meet some conditions and write them back to files for further processing. ( On other words, I need to filter out some events) I am using the WordCount example to get going. public static class Map extends MapperLongWritable, Text, Text, IntWritable { private final static IntWritable one = new IntWritable(1); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { if(-1 != meetConditions(value)) { context.write(value, one); } } } public static class Reduce extends ReducerText, IntWritable, Text, IntWritable { public void reduce(Text key, IterableIntWritable values, Context context) throws IOException, InterruptedException { context.write(key, new IntWritable(1)); } } The problem is that it prints the value 1 after each entry. Hence my question. What is the best trivial implementation of the map and reduce function to address the use case above ? Thank you greatly for your help
Re: Detect when file is not being written by another process
Hi Peter AFAIK oozie has a mechanism to achieve this. You can trigger your jobs as soon as the files are written to a certain hdfs directory. On Tue, Sep 25, 2012 at 10:23 PM, Peter Sheridan psheri...@millennialmedia.com wrote: These are log files being deposited by other processes, which we may not have control over. We don't want multiple processes to write to the same files — we just don't want to start our jobs until they have been completely written. Sorry for lack of clarity thanks for the response. --Pete From: Bertrand Dechoux decho...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Tuesday, September 25, 2012 12:33 PM To: user@hadoop.apache.org user@hadoop.apache.org Subject: Re: Detect when file is not being written by another process Hi, Multiple files and aggregation or something like hbase? Could you tell use more about your context? What are the volumes? Why do you want multiple processes to write to the same file? Regards Bertrand On Tue, Sep 25, 2012 at 6:28 PM, Peter Sheridan psheri...@millennialmedia.com wrote: Hi all. We're using Hadoop 1.0.3. We need to pick up a set of large (4+GB) files when they've finished being written to HDFS by a different process. There doesn't appear to be an API specifically for this. We had discovered through experimentation that the FileSystem.append() method can be used for this purpose — it will fail if another process is writing to the file. However: when running this on a multi-node cluster, using that API actually corrupts the file. Perhaps this is a known issue? Looking at the bug tracker I see https://issues.apache.org/jira/browse/HDFS-265 and a bunch of similar-sounding things. What's the right way to solve this problem? Thanks. --Pete -- Bertrand Dechoux
Re: Job failed with large volume of small data: java.io.EOFException
Hi Jason Are you seeing any errors in your data node logs. Specifically like ' xceivers count exceeded'. In that case you may need to bump up te value of dfs.datanode.max.xcievers to ahigher value. If not, it is possible that you are crossing the upper limit of open files on your linux boxes that run DNs. You can verify the current value using 'ulimit -n' and then try increasing the same to a much higher value. Regards Bejoy KS
Re: How to make the hive external table read from subdirectories
Hi Nataraj Once you have created a partitioned table you need to add the partitions, only then the data in sub dirs will be visible to hive. After creating the table you need to execute a command like below ALTER TABLE some_table ADD PARTITION (year='2012', month='09', dayofmonth='11') LOCATION '/user/myuser/MapReduceOutput/2012/09/11'; Like this you need to register each of the paritions. After this your query should work as desired. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com Date: Thu, 13 Sep 2012 03:04:52 To: user@hadoop.apache.orguser@hadoop.apache.org; bejoy.had...@gmail.combejoy.had...@gmail.com Subject: RE: How to make the hive external table read from subdirectories Thanks for your response. Can someone see if this is ok? I am not getting any records when I query the hive table when I use Partitions. This is how I am creating the table. CREATE EXTERNAL TABLE Data (field1 STRING,field2) PARTITIONED BY(year STRING, month STRING, dayofmonth STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n' STORED AS TEXTFILE LOCATION '/user/myuser/MapReduceOutput'; Data dir looks like this. /user/myuser/MapReduceOutput/2012/09/11 When I create the table using '/user/myuser/MapReduceOutput/2012/09/11' as the location, I can query the table and get data back. Please advice, Thanks. From: Bejoy KS [mailto:bejoy.had...@gmail.com] Sent: Wednesday, September 12, 2012 3:09 PM To: user@hadoop.apache.org Subject: Re: How to make the hive external table read from subdirectories Hi Natraj Create a partitioned table and add the sub dirs as partitions. You need to have some logic in place for determining the partitions. Say if the sub dirs denote data based on a date then make date as the partition. Regards Bejoy KS Sent from handheld, please excuse typos. From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com Date: Wed, 12 Sep 2012 19:19:19 + To: user@hadoop.apache.orguser@hadoop.apache.org ReplyTo: user@hadoop.apache.org Subject: How to make the hive external table read from subdirectories I have a hive external table created from a hdfs location. How do I make it read the data from all the subdirectories also? Thanks. *** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system. Thank You.
Re: How to make the hive external table read from subdirectories
Hi Natraj Create a partitioned table and add the sub dirs as partitions. You need to have some logic in place for determining the partitions. Say if the sub dirs denote data based on a date then make date as the partition. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Nataraj Rashmi - rnatar rashmi.nata...@acxiom.com Date: Wed, 12 Sep 2012 19:19:19 To: user@hadoop.apache.orguser@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: How to make the hive external table read from subdirectories I have a hive external table created from a hdfs location. How do I make it read the data from all the subdirectories also? Thanks. *** The information contained in this communication is confidential, is intended only for the use of the recipient named above, and may be legally privileged. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this communication in error, please resend this communication to the sender and delete the original message or any copy of it from your computer system. Thank You.
Re: what's the default reducer number?
Hi Lin The default value for number of reducers is 1 namemapred.reduce.tasks/name value1/value It is not determined by data volume. You need to specify the number of reducers for your mapreduce jobs as per your data volume. Regards Bejoy KS On Tue, Sep 11, 2012 at 4:53 PM, Jason Yang lin.yang.ja...@gmail.comwrote: Hi, all I was wondering what's the default number of reducer if I don't set it in configuration? Will it change dynamically according to the output volume of Mapper? -- YANG, Lin
Re: what's the default reducer number?
Hi Lin The default values for all the properties are in core-default.xml hdfs-default.xml and mapred-default.xml Regards Bejoy KS On Tue, Sep 11, 2012 at 5:06 PM, Jason Yang lin.yang.ja...@gmail.comwrote: Hi, Bejoy Thanks for you reply. where could I find the default value of mapred.reduce.tasks ? I have checked the core-site.xml, hdfs-site.xml and mapred-site.xml, but I haven't found it. 2012/9/11 Bejoy Ks bejoy.had...@gmail.com Hi Lin The default value for number of reducers is 1 namemapred.reduce.tasks/name value1/value It is not determined by data volume. You need to specify the number of reducers for your mapreduce jobs as per your data volume. Regards Bejoy KS On Tue, Sep 11, 2012 at 4:53 PM, Jason Yang lin.yang.ja...@gmail.comwrote: Hi, all I was wondering what's the default number of reducer if I don't set it in configuration? Will it change dynamically according to the output volume of Mapper? -- YANG, Lin -- YANG, Lin
Re: Some general questions about DBInputFormat
Hi Yaron Sqoop uses a similar implementation. You can get some details there. Replies inline • (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs? From my small experience Most MR jobs have data in hdfs. It is useful for getting data out of rdbms to hadoop, sqoop implemenation is an example. • Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? Num of mappers shouldn't be more than the permissible number of connections allowed for that db. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Yaron Gonen yaron.go...@gmail.com Date: Tue, 11 Sep 2012 15:41:26 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Some general questions about DBInputFormat Hi, After reviewing the class's (not very complicated) code, I have some questions I hope someone can answer: - (more general question) Are there many use-cases for using DBInputFormat? Do most Hadoop jobs take their input from files or DBs? - What happens when the database is updated during mappers' data retrieval phase? is there a way to lock the database before the data retrieval phase and release it afterwords? - Since all mappers open a connection to the same DBS, one cannot use hundreds of mapper. Is there a solution to this problem? Thanks, Yaron
Re: How to remove datanode from cluster..
Hi Yogesh The detailed steps are available in hadoop wiki on FAQ page http://wiki.apache.org/hadoop/FAQ#I_want_to_make_a_large_cluster_smaller_by_taking_out_a_bunch_of_nodes_simultaneously._How_can_this_be_done.3F Regrads Bejoy KS On Wed, Sep 12, 2012 at 12:14 AM, yogesh dhari yogeshdh...@live.com wrote: Hello all, I am not getting the clear way out to remove datanode from the cluster. please explain me decommissioning steps with example. like how to creating exclude files and other steps involved in it. Thanks regards Yogesh Kumar
Re: Reg: parsing all files file append
Hi Manoj From my limited knowledge on file appends in hdfs , i have seen more recommendations to use sync() in the latest releases than using append(). Let us wait for some commiter to authoritatively comment on 'the production readiness of append()' . :) Regards Bejoy KS On Mon, Sep 10, 2012 at 11:03 AM, Manoj Babu manoj...@gmail.com wrote: Thank you Bejoy. Does file append is production stable? Cheers! Manoj. On Sun, Sep 9, 2012 at 10:19 PM, Bejoy KS bejoy.had...@gmail.com wrote: ** Hi Manoj You can load daily logs into a individual directories in hdfs and process them daily. Keep those results in hdfs or hbase or dbs etc. Every day do the processing, get the results and aggregate the same with the previously aggregated results till date. Regards Bejoy KS Sent from handheld, please excuse typos. -- *From: * Manoj Babu manoj...@gmail.com *Date: *Sun, 9 Sep 2012 21:28:54 +0530 *To: *mapreduce-user@hadoop.apache.org *ReplyTo: * mapreduce-user@hadoop.apache.org *Subject: *Reg: parsing all files file append Hi All, I have two questions, providing info on it will be helpful. 1, I am using hadoop to analyze and to find top n search term metric's from logs. If any new log file is added to HDFS then again we are running the job to find the metrics. Daily we will be getting log files and we are parsing the whole file and getting the metric's. All the log file's are parsed daily to get the latest metric's is there any way is there any way to avoid this? 2, Does file append is production stable? Cheers! Manoj.
Re: Reg: parsing all files file append
Hi Manoj You can load daily logs into a individual directories in hdfs and process them daily. Keep those results in hdfs or hbase or dbs etc. Every day do the processing, get the results and aggregate the same with the previously aggregated results till date. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Manoj Babu manoj...@gmail.com Date: Sun, 9 Sep 2012 21:28:54 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Reg: parsing all files file append Hi All, I have two questions, providing info on it will be helpful. 1, I am using hadoop to analyze and to find top n search term metric's from logs. If any new log file is added to HDFS then again we are running the job to find the metrics. Daily we will be getting log files and we are parsing the whole file and getting the metric's. All the log file's are parsed daily to get the latest metric's is there any way is there any way to avoid this? 2, Does file append is production stable? Cheers! Manoj.
Re: Using hadoop for analytics
Hi Prashant Welcome to Hadoop Community. :) Hadoop is meant for processing large data volumes. Saying that, for your custom requirements you should write your own mapper and reducer that contains your business logic for processing the input data. Also you can have a look at hive and pig, which are tools built on top of map reduce that is highly used for data analysis. Hive supports SQL like queries. If your requirements could be satisfied with Hive or Pig, it is highly recommend to go with those. On Wed, Sep 5, 2012 at 2:12 PM, pgaurav pgauravi...@gmail.com wrote: Hi Guys, I’m 5 days old in hadoop world and trying to analyse this as a long term solution to our client. I could do some rd on Amazon EC2 / EMR: Load the data, text / csv, to S3 Write your mapper / reducer / Jobclient and upload the jar to s3 Start a job flow I tried 2 sample code, word count and csv data process. My question is that to further analyse the data / reporting / search, what should be done? Do I need to implement in Mapper class itself? Do I need to dump the data to the database and then write some custom application? What is the standard way to analysing the data? Thanks Prashant -- View this message in context: http://old.nabble.com/Using-hadoop-for-analytics-tp34391246p34391246.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Replication Factor Modification
Hi You can change the replication factor of an existing directory using '-setrep' http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep The below command will recursively set the replication factor to 1 for all files within the given directory '/user' hadoop fs -setrep -w 1 -R /user On Wed, Sep 5, 2012 at 11:39 PM, Uddipan Mukherjee uddipan_mukher...@infosys.com wrote: Hi, We have a requirement where we have change our Hadoop Cluster's Replication Factor without restarting the Cluster. We are running our Cluster on Amazon EMR. Can you please suggest the way to achieve this? Any pointer to this will be very helpful. Thanks And Regards Uddipan Mukherjee CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS***
Re: Replication Factor Modification
Hi Uddipan As Harsh mentioned, replication factor is a client side property . So you need to update the value for 'dfs.replication' in hdfs-site.xml as per your requirement in your edge nodes or from the machines your are copying files to hdfs. If you are using some of the existing DN's for this purpose (as client) you need to update the value in there. No need of restarting the services. On Wed, Sep 5, 2012 at 11:54 PM, Uddipan Mukherjee uddipan_mukher...@infosys.com wrote: Hi, ** ** Thanks for the help. But How I will set the replication factor as desired so that when new files comes in it will automatically take the new value of dfs.replication without a cluster restart. Please note we have a 200 nodes cluster. ** ** Thanks and Regards, Uddipan Mukherjee ** ** *From:* Harsh J [mailto:ha...@cloudera.com] *Sent:* Wednesday, September 05, 2012 7:17 PM *To:* user@hadoop.apache.org *Subject:* Re: Replication Factor Modification ** ** Replication factor is per-file, and is a client-side property. So, this is doable. ** ** 1. Change the replication factor of all existing files (or needed ones):** ** ** ** $ hadoop fs -setrep -R value / ** ** 2. Change the dfs.replication parameter in all client configs to the desired value On Wed, Sep 5, 2012 at 11:39 PM, Uddipan Mukherjee uddipan_mukher...@infosys.com wrote: Hi, We have a requirement where we have change our Hadoop Cluster's Replication Factor without restarting the Cluster. We are running our Cluster on Amazon EMR. Can you please suggest the way to achieve this? Any pointer to this will be very helpful. Thanks And Regards Uddipan Mukherjee CAUTION - Disclaimer * This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely for the use of the addressee(s). If you are not the intended recipient, please notify the sender by e-mail and delete the original message. Further, you are not to copy, disclose, or distribute this e-mail or its contents to any other person and any such actions are unlawful. This e-mail may contain viruses. Infosys has taken every reasonable precaution to minimize this risk, but is not liable for any damage you may sustain as a result of any virus in this e-mail. You should carry out your own virus checks before opening the e-mail or attachment. Infosys reserves the right to monitor and review the content of all messages sent to or from this e-mail address. Messages sent to or from this e-mail address may be stored on the Infosys e-mail system. ***INFOSYS End of Disclaimer INFOSYS*** ** ** -- Harsh J
Re: Integrating hadoop with java UI application deployed on tomcat
Hi You are running tomact on a windows machine and trying to connect to a remote hadoop cluster from there. Your core site has name fs.default.name/name valuehdfs://localhost:9000/value But It is localhost here.( I assume you are not running hadoop on this windows environment for some testing) You need to have the exact configuration files and hadoop jars from the cluster machines on this tomcat environment as well. I mean on the classpath of your application. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Visioner Sadak visioner.sa...@gmail.com Date: Tue, 4 Sep 2012 15:31:25 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: Integrating hadoop with java UI application deployed on tomcat also getting one more error * org.apache.hadoop.ipc.RemoteException*: Server IPC version 5 cannot communicate with client version 4 On Tue, Sep 4, 2012 at 2:44 PM, Visioner Sadak visioner.sa...@gmail.comwrote: Thanks shobha tried adding conf folder to tomcats classpath still getting same error Call to localhost/127.0.0.1:9000 failed on local exception: java.io.IOException: An established connection was aborted by the software in your host machine On Tue, Sep 4, 2012 at 11:18 AM, Mahadevappa, Shobha shobha.mahadeva...@nttdata.com wrote: Hi, Try adding the hadoop/conf directory in the TOMCAT’s classpath ** ** Ex : CLASSPATH=/usr/local/Apps/hbase-0.90.4/conf:/usr/local/Apps/hadoop-0.20.203.0/conf: ** ** ** ** ** ** Regards, *Shobha M * ** ** *From:* Visioner Sadak [mailto:visioner.sa...@gmail.com] *Sent:* 03 September 2012 PM 04:01 *To:* user@hadoop.apache.org *Subject:* Re: Integrating hadoop with java UI application deployed on tomcat ** ** Thanks steve thers nothing in logs and no exceptions as well i found that some file is created in my F:\user with directory name but its not visible inside my hadoop browse filesystem directories i also added the config by using the below method hadoopConf.addResource( F:/hadoop-0.22.0/conf/core-site.xml); when running thru WAR printing out the filesystem i m getting org.apache.hadoop.fs.LocalFileSystem@9cd8db when running an independet jar within hadoop i m getting DFS[DFSClient[clientName=DFSClient_296231340, ugi=dell]] when running an independet jar i m able to do uploads just wanted to know will i have to add something in my classpath of tomcat or is there any other configurations of core-site.xml that i am missing out..thanks for your help. ** ** On Sat, Sep 1, 2012 at 1:38 PM, Steve Loughran ste...@hortonworks.com wrote: ** ** well, it's worked for me in the past outside Hadoop itself: ** ** http://smartfrog.svn.sourceforge.net/viewvc/smartfrog/trunk/core/hadoop-components/hadoop-ops/src/org/smartfrog/services/hadoop/operations/utils/DfsUtils.java?revision=8882view=markup ** ** 1. Turn logging up to DEBUG 2. Make sure that the filesystem you've just loaded is what you expect, by logging its value. It may turn out to be file:///, because the normal Hadoop site-config.xml isn't being picked up ** ** On Fri, Aug 31, 2012 at 1:08 AM, Visioner Sadak visioner.sa...@gmail.com wrote: but the problem is that my code gets executed with the warning but file is not copied to hdfs , actually i m trying to copy a file from local to hdfs Configuration hadoopConf=new Configuration(); //get the default associated file system FileSystem fileSystem=FileSystem.get(hadoopConf); // HarFileSystem harFileSystem= new HarFileSystem(fileSystem); //copy from lfs to hdfs fileSystem.copyFromLocalFile(new Path(E:/test/GANI.jpg),new Path(/user/TestDir/)); ** ** ** ** __ Disclaimer:This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding
Re: Exception while running a Hadoop example on a standalone install on Windows 7
Hi Udayani By default hadoop works well for linux and linux based OS. Since you are on Windows you need to install and configure ssh using cygwin before you start hadoop daemons. On Tue, Sep 4, 2012 at 6:16 PM, Udayini Pendyala udayini_pendy...@yahoo.com wrote: Hi, Following is a description of what I am trying to do and the steps I followed. GOAL: a). Install Hadoop 1.0.3 b). Hadoop in a standalone (or local) mode c). OS: Windows 7 STEPS FOLLOWED: 1.1. I followed instructions from: http://www.oreillynet.com/pub/a/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html. Listing the steps I did - a. I went to: http://hadoop.apache.org/core/releases.html. b. I installed hadoop-1.0.3 by downloading “hadoop-1.0.3.tar.gz” and unzipping/untarring the file. c. I installed JDK 1.6 and set up JAVA_HOME to point to it. d. I set up HADOOP_INSTALL to point to my Hadoop install location. I updated my PATH variable to have $HADOOP_INSTALL/bin e. After the above steps, I ran the command: “hadoop version” and got the following information: $ hadoop version Hadoop 1.0.3 Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192 Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012 From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be 2. 2. The standalone was very easy to install as described above. Then, I tried to run a sample command as given in: http://hadoop.apache.org/common/docs/r0.17.2/quickstart.html#Local Specifically, the steps followed were: a. cd $HADOOP_INSTALL b. mkdir input c. cp conf/*.xml input d. bin/hadoop jar hadoop-examples-1.0.3.jar grep input output ‘dfs[a-z.]+’ and got the following error: $ bin/hadoop jar hadoop-examples-1.0.3.jar grep input output 'dfs[a-z.]+' 12/09/03 15:41:57 WARN util.NativeCodeLoader: Unable to load native-hadoop libra ry for your platform... using builtin-java classes where applicable 12/09/03 15:41:57 ERROR security.UserGroupInformation: PriviledgedActionExceptio n as:upendyal cause:java.io.IOException: Failed to set permissions of path: \tmp \hadoop-upendyal\mapred\staging\upendyal-1075683580\.staging to 0700 java.io http://java.io.IO.IOException: Failed to set permissions of path: \tmp\hadoop-upendyal\map red\staging\upendyal-1075683580\.staging to 0700 at org.apache.hadoop.fs.FileUtil.checkReturnValue(FileUtil.java:689) at org.apache.hadoop.fs.FileUtil.setPermission(FileUtil.java:662) at org.apache.hadoop.fs.RawLocalFileSystem.setPermission(RawLocalFileSys tem.java:509) at org.apache.hadoop.fs.RawLocalFileSystem.mkdirs(RawLocalFileSystem.jav a:344) at org.apache.hadoop.fs.FilterFileSystem.mkdirs(FilterFileSystem.java:18 9) at org.apache.hadoop.mapreduce.JobSubmissionFiles.getStagingDir(JobSubmi ssionFiles.java:116) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:856) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:850) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Unknown Source) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInforma tion.java:1121) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:8 50) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:824) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1261) at org.apache.hadoop.examples.Grep.run(Grep.java:69) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Grep.main(Grep.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(Progra mDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) 3.3. I googled the problem and found the following links but none of these suggestions helped. Most people seem to be getting a resolution when they change the version of Hadoop. a. http://mail-archives.apache.org/mod_mbox/hadoop-common-user/201105.mbox/%3cbanlktin-8+z8uybtdmaa4cvxz4jzm14...@mail.gmail.com%3E b. http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/25837 Is this a problem in the version of Hadoop I selected OR am I doing something wrong? I would appreciate any help with this. Thanks Udayini
Re: reading a binary file
Hi Francesco TextInputFormat reads line by line based on '\n' by default, there the key values is the position offset and the line contents respectively. But in your case it is just a sequence of integers and also it is Binary. Also you require the offset for each integer value and not offset by line. I believe you may have to write your own custom Record Reader to get this done. On Mon, Sep 3, 2012 at 8:38 PM, Francesco Silvestri yuri@gmail.comwrote: Hi Mohammad, SequenceFileInputFormathttp://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html requires the file to be a sequence of key/value stored in binary (i.e., the key is stored in the file). In my case, the key is implicitly given by the position of the value within the file. Thank you, Francesco On Mon, Sep 3, 2012 at 5:01 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Francesco, Have a look at SequenceFileInputFormat : http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html Regards, Mohammad Tariq On Mon, Sep 3, 2012 at 8:26 PM, Francesco Silvestri yuri@gmail.comwrote: Hello, I have a binary file of integers and I would like an input format that generates pairs key,value, where value is an integer in the file and key the position of the integer in the file. Which class should I use? (i.e. I'm looking for a kind of TextinputFormat for binary files) Thank you for your consideration, Francesco
Re: knowing the nodes on which reduce tasks will run
HI Abhay The TaskTrackers on which the reduce tasks are triggered is chosen in random based on the reduce slot availability. So if you don't need the reduce tasks to be scheduled on some particular nodes you need to set 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The bottleneck here is that this property is not a job level one you need to set it on a cluster level. A cleaner approach will be to configure each of your nodes with the right number of map and reduce slots based on the resources available on each machine. On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, How can one get to know the nodes on which reduce tasks will run? One of my job is running and it's completing all the map tasks. My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. If the reduce task take any node from cluster then It'll try to copy the data to same disk and it'll eventually fail due to Disk space related exceptions. I have added few more tasktracker nodes in the cluster and now want to run reducer on new nodes only. Is it possible to choose a node on which the reducer will run? What's the algorithm hadoop uses to get a new node to run reducer? Thanks in advance. Bye Abhay
Re: knowing the nodes on which reduce tasks will run
Hi Abhay You need this value to be changed before you submit your job and restart TT. Modifying this value in mid time won't affect the running jobs. On Mon, Sep 3, 2012 at 9:06 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: How can I set 'mapred.tasktracker.reduce.tasks.maximum' to 0 in a running tasktracker? Seems that I need to restart the tasktracker and in that case I'll loose the output of map tasks by particular tasktracker. Can I change 'mapred.tasktracker.reduce.tasks.maximum' to 0 without restarting tasktracker? ~Abhay On Mon, Sep 3, 2012 at 8:53 PM, Bejoy Ks bejoy.had...@gmail.com wrote: HI Abhay The TaskTrackers on which the reduce tasks are triggered is chosen in random based on the reduce slot availability. So if you don't need the reduce tasks to be scheduled on some particular nodes you need to set 'mapred.tasktracker.reduce.tasks.maximum' on those nodes to 0. The bottleneck here is that this property is not a job level one you need to set it on a cluster level. A cleaner approach will be to configure each of your nodes with the right number of map and reduce slots based on the resources available on each machine. On Mon, Sep 3, 2012 at 7:49 PM, Abhay Ratnaparkhi abhay.ratnapar...@gmail.com wrote: Hello, How can one get to know the nodes on which reduce tasks will run? One of my job is running and it's completing all the map tasks. My map tasks write lots of intermediate data. The intermediate directory is getting full on all the nodes. If the reduce task take any node from cluster then It'll try to copy the data to same disk and it'll eventually fail due to Disk space related exceptions. I have added few more tasktracker nodes in the cluster and now want to run reducer on new nodes only. Is it possible to choose a node on which the reducer will run? What's the algorithm hadoop uses to get a new node to run reducer? Thanks in advance. Bye Abhay
Re: MRBench Maps strange behaviour
Hi Gaurav You can get the information on the num of map tasks in the job from the JT web UI itself. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Gaurav Dasgupta gdsay...@gmail.com Date: Wed, 29 Aug 2012 13:14:11 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: MRBench Maps strange behaviour Hi Hemanth, Thanks for the reply. Can you tell me how can I calculate or ensure from the counters what should be the exact number of Maps? Thanks, Gaurav Dasgupta On Wed, Aug 29, 2012 at 11:26 AM, Hemanth Yamijala yhema...@gmail.comwrote: Hi, The number of maps specified to any map reduce program (including those part of MRBench) is generally only a hint, and the actual number of maps will be influenced in typical cases by the amount of data being processed. You can take a look at this wiki link to understand more: http://wiki.apache.org/hadoop/HowManyMapsAndReduces In the examples below, since the data you've generated is different, the number of mappers are different. To be able to judge your benchmark results, you'd need to benchmark against the same data (or at least same type of type - i.e. size and type). The number of maps printed at the end is straight from the input specified and doesn't reflect what the job actually ran with. The information from the counters is the right one. Thanks Hemanth On Tue, Aug 28, 2012 at 4:02 PM, Gaurav Dasgupta gdsay...@gmail.com wrote: Hi All, I executed the MRBench program from hadoop-test.jar in my 12 node CDH3 cluster. After executing, I had some strange observations regarding the number of Maps it ran. First I ran the command: hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -numRuns 3 -maps 200 -reduces 200 -inputLines 1024 -inputType random And I could see that the actual number of Maps it ran was 201 (for all the 3 runs) instead of 200 (Though the end report displays the launched to be 200). Here is the console report: 12/08/28 04:34:35 INFO mapred.JobClient: Job complete: job_201208230144_0035 12/08/28 04:34:35 INFO mapred.JobClient: Counters: 28 12/08/28 04:34:35 INFO mapred.JobClient: Job Counters 12/08/28 04:34:35 INFO mapred.JobClient: Launched reduce tasks=200 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=617209 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/28 04:34:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/28 04:34:35 INFO mapred.JobClient: Rack-local map tasks=137 12/08/28 04:34:35 INFO mapred.JobClient: Launched map tasks=201 12/08/28 04:34:35 INFO mapred.JobClient: Data-local map tasks=64 12/08/28 04:34:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=1756882 Again, I ran the MRBench for just 10 Maps and 10 Reduces: hadoop jar /usr/lib/hadoop-0.20/hadoop-test.jar mrbench -maps 10 -reduces 10 This time the actual number of Maps were only 2 and again the end report displays Maps Lauched to be 10. The console output: 12/08/28 05:05:35 INFO mapred.JobClient: Job complete: job_201208230144_0040 12/08/28 05:05:35 INFO mapred.JobClient: Counters: 27 12/08/28 05:05:35 INFO mapred.JobClient: Job Counters 12/08/28 05:05:35 INFO mapred.JobClient: Launched reduce tasks=20 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=6648 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/08/28 05:05:35 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/08/28 05:05:35 INFO mapred.JobClient: Launched map tasks=2 12/08/28 05:05:35 INFO mapred.JobClient: Data-local map tasks=2 12/08/28 05:05:35 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=163257 12/08/28 05:05:35 INFO mapred.JobClient: FileSystemCounters 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_READ=407 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_READ=258 12/08/28 05:05:35 INFO mapred.JobClient: FILE_BYTES_WRITTEN=1072596 12/08/28 05:05:35 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=3 12/08/28 05:05:35 INFO mapred.JobClient: Map-Reduce Framework 12/08/28 05:05:35 INFO mapred.JobClient: Map input records=1 12/08/28 05:05:35 INFO mapred.JobClient: Reduce shuffle bytes=647 12/08/28 05:05:35 INFO mapred.JobClient: Spilled Records=2 12/08/28 05:05:35 INFO mapred.JobClient: Map output bytes=5 12/08/28 05:05:35 INFO mapred.JobClient: CPU time spent (ms)=17070 12/08/28 05:05:35 INFO mapred.JobClient: Total committed heap usage (bytes)=6218842112 12/08/28 05:05:35 INFO mapred.JobClient: Map input bytes=2 12/08/28 05:05:35 INFO mapred.JobClient: Combine input records=0 12/08/28 05:05:35 INFO mapred.JobClient
Re: one reducer is hanged in reduce- copy phase
Hi Abhay The map outputs are deleted only after the reducer runs to completion. Is it possible to run the same attempt again? Does killing the child java process or tasktracker on the node help? (since hadoop may schedule a reduce attempt on another node). Yes,it is possible to re attempt the task again for that you need to fail the current attempt. Can I copy the map intermediate output required for this single reducer (which is hanged) and rerun only the hang reducer? It is not that easy to accomplish this. Better fail the task explicitly so that the it is re attempted. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Tue, 28 Aug 2012 19:40:58 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: one reducer is hanged in reduce- copy phase Hello, I have a MR job which has 4 reducers running. One of the reduce attempt is pending since long time in reduce-copy phase. The job is not able to complete because of this. I have seen that the child java process on tasktracker is running. Is it possible to run the same attempt again? Does killing the child java process or tasktracker on the node help? (since hadoop may schedule a reduce attempt on another node). Can I copy the map intermediate output required for this single reducer (which is hanged) and rerun only the hang reducer? Thank you in advance. ~Abhay ask_201208250623_0005_r_00http://dpep089.innovate.ibm.com:50030/taskdetails.jsp?tipid=task_201208250623_0005_r_00 26.41% reduce copy(103 of 130 at 0.08 MB/s) 28-Aug-2012 03:09:34
Re: namenode not starting
Hi Abhay What is the value for hadoop.tmp.dir or dfs.name.dir . If it was set to /tmp the contents would be deleted on a OS restart. You need to change this location before you start your NN. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Abhay Ratnaparkhi abhay.ratnapar...@gmail.com Date: Fri, 24 Aug 2012 12:58:41 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: namenode not starting Hello, I had a running hadoop cluster. I restarted it and after that namenode is unable to start. I am getting error saying that it's not formatted. :( Is it possible to recover the data on HDFS? 2012-08-24 03:17:55,378 ERROR org.apache.hadoop.hdfs.server.namenode.FSNamesystem: FSNamesystem initialization failed. java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) 2012-08-24 03:17:55,380 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.io.IOException: NameNode is not formatted. at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:434) at org.apache.hadoop.hdfs.server.namenode.FSDirectory.loadFSImage(FSDirectory.java:110) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:291) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:270) at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:271) at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:303) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:433) at org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:421) at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1359) at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1368) Regards, Abhay
Re: Streaming issue ( URGENT )
Hi Siddharth Joins are better implemented in hive and pig. Try checking out those and see whether it fits your requirements. If you are still looking for implementing joins using mapreduce, you can take a look at this example which uses MultipleInputs http://kickstarthadoop.blogspot.in/2011/09/joins-with-plain-map-reduce.html Regards Bejoy KS
Re: Number of Maps running more than expected
Hi Gaurav How many input files are there for the wordcount map reduce job? Do you have input files lesser than a block size? If you are using the default TextInputFormat there will be one task generated per file for sure, so if you have files less than block size the calculation specified here for number of splits won't hold. If small files are there then definitely the number of maps tasks should be more. Also did you change the split sizes as well along with block size? Regards Bejoy KS
Re: help in distribution of a task with hadoop
Hi Bertrand -libjars option works well with the 'hadoop jar' command. Instead of executing your runnable with the plain java 'jar' command use 'hadoop jar' . When you use hadoop jar you can ship the dependent jars/files etc as 1) include them in the /lib folder in your jar 2) use -libjars / -files to distribute jars or files Regards Bejoy KS
Re: how to enhance job start up speed?
Hi Matthais When an mapreduce program is being used there are some extra steps like checking for input and output dir, calclulating input splits, JT assigning TT for executing the task etc. If your file is non splittable , then one map task per file will be generated irrespective of the number of hdfs blocks. Now some blocks will be in a different node than the node where map task is executed so time will be spend here on the network transfer. In your case MR would be a overhead as your file is non splittable hence no parallelism and also there is an overhead of copying blocks to the map task node. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Matthias Kricke matthias.mk.kri...@gmail.com Sender: matthias.zeng...@gmail.com Date: Mon, 13 Aug 2012 16:33:06 To: user@hadoop.apache.org Reply-To: user@hadoop.apache.org Subject: Re: how to enhance job start up speed? Ok, I try to clarify: 1) The worker is the logic inside my mapper and the same for both cases. 2) I have two cases. In the first one I use hadoop to execute my worker and in a second one, I execute my worker without hadoop (simple read of the file). Now I measured, for both cases, the time the worker and the surroundings need (so i have two values for each case). The worker took the same time in both cases for the same input (this is expected). But the surroundings took 17% more time when using hadoop. 3) ~ 3GB. I want to know how to reduce this difference and where they come from. I hope that helped? If not, feel free to ask again :) Greetings, MK P.S. just for your information, I did the same test with hypertable as well. I got: * worker without anything: 15% overhead * worker with hadoop: 32% overhead * worker with hypertable: 53% overhead Remark: overhead was measured in comparison to the worker. e.g. hypertable uses 53% of the whole process time, while worker uses 47%. 2012/8/13 Bertrand Dechoux decho...@gmail.com I am not sure to understand and I guess I am not the only one. 1) What's a worker in your context? Only the logic inside your Mapper or something else? 2) You should clarify your cases. You seem to have two cases but both are in overhead so I am assuming there is a baseline? Hadoop vs sequential, so sequential is not Hadoop? 3) What are the size of the file? Bertrand On Mon, Aug 13, 2012 at 1:51 PM, Matthias Kricke matthias.mk.kri...@gmail.com wrote: Hello all, I'm using CDH3u3. If I want to process one File, set to non splitable hadoop starts one Mapper and no Reducer (thats ok for this test scenario). The Mapper goes through a configuration step where some variables for the worker inside the mapper are initialized. Now the Mapper gives me K,V-pairs, which are lines of an input file. I process the V with the worker. When I compare the run time of hadoop to the run time of the same process in sequentiell manner, I get: worker time -- same in both cases case: mapper -- overhead of ~32% to the worker process (same for bigger chunk size) case: sequentiell -- overhead of ~15% to the worker process It shouldn't be that much slower, because of non splitable, the mapper will be executed where the data is saved by HDFS, won't it? Where did those 17% go? How to reduce this? Did hadoop needs the whole time for reading or streaming the data out of HDFS? I would appreciate your help, Greetings mk -- Bertrand Dechoux
Re: Hbase JDBC API
Hi Sandeep You can have a look at HbaseStorageHandler which maps the hbase tables to hive tables . Once this mapping is done you can use the hive jdbc to query Hbase tables. See whether this hive Hbase Integration suits your requirement. https://cwiki.apache.org/confluence/display/Hive/HBaseIntegration Regards Bejoy KS
Re: fs.local.block.size vs file.blocksize
HI Rahul Better to to start a new thread than hijacking others .:) It helps to keep the mailing list archives clean. Learning java, you need to get some JAVA books and start off. If you just want to run wordcount example just follow the steps in below url http://wiki.apache.org/hadoop/WordCount To understand more details on the working, i have just scribbled something long back, may be it can help you start off http://kickstarthadoop.blogspot.in/2011/04/word-count-hadoop-map-reduce-example.html Regards Bejoy KS
Re: Problem with hadoop filesystem after restart cluster
Hi Andy Is your hadoop.tmp.dir or dfs.name.dir configured to /tmp? If so it can happen as /tmp dir gets wiped out on OS restarts Regards Bejoy KS
Re: Reading fields from a Text line
That is a good pointer Harsh. Thanks a lot. But if IdentityMapper is being used shouldn't the job.xml reflect that? But Job.xml always shows mapper as our CustomMapper. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 13:02:32 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J
Re: Reading fields from a Text line
Ok Got it now. That is a good piece of information. Thank You :) Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 16:28:27 To: mapreduce-user@hadoop.apache.org; bejoy.had...@gmail.com Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line Bejoy, In the new API, the default map() function, if not properly overridden, is the identity map function. There is no IdentityMapper class in the new API, the Mapper class itself is identity by default. On Fri, Aug 3, 2012 at 1:07 PM, Bejoy KS bejoy.had...@gmail.com wrote: That is a good pointer Harsh. Thanks a lot. But if IdentityMapper is being used shouldn't the job.xml reflect that? But Job.xml always shows mapper as our CustomMapper. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Harsh J ha...@cloudera.com Date: Fri, 3 Aug 2012 13:02:32 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Cc: Mohammad Tariqdonta...@gmail.com Subject: Re: Reading fields from a Text line That is not really a bug. Only if you use @Override will you be really asserting that you've overriden the right method (since new API uses inheritance instead of interfaces). Without that kinda check, its easy to make mistakes and add in methods that won't get considered by the framework (and hence the default IdentityMapper comes into play). Always use @Override annotations when inheriting and overriding methods. On Fri, Aug 3, 2012 at 4:41 AM, Bejoy Ks bejoy.had...@gmail.com wrote: Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS -- Harsh J -- Harsh J
Re: Reading fields from a Text line
Hi Tariq I assume the mapper being used is IdentityMapper instead of XPTMapper class. Can you share your main class? If you are using TextInputFormat an reading from a file in hdfs, it should have LongWritable Keys as input and your code has IntWritable as the input key type. Have a check on that as well. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Mohammad Tariq donta...@gmail.com Date: Thu, 2 Aug 2012 15:48:42 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Re: Reading fields from a Text line Thanks for the response Harsh n Sri. Actually, I was trying to prepare a template for my application using which I was trying to read one line at a time, extract the first field from it and emit that extracted value from the mapper. I have these few lines of code for that : public static class XPTMapper extends MapperIntWritable, Text, LongWritable, Text{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ Text word = new Text(); String line = value.toString(); if (!line.startsWith(TT)){ context.setStatus(INVALID LINE..SKIPPING); }else{ String stdid = line.substring(0, 7); word.set(stdid); context.write(key, word); } } But the output file contains all the rows of the input file including the lines which I was expecting to get skipped. Also, I was expecting only the fields I am emitting but the file contains entire lines. Could you guys please point out the the mistake I might have made. (Pardon my ignorance, as I am not very good at MapReduce).Many thanks. Regards, Mohammad Tariq On Thu, Aug 2, 2012 at 10:58 AM, Sriram Ramachandrasekaran sri.ram...@gmail.com wrote: Wouldn't it be better if you could skip those unwanted lines upfront(preprocess) and have a file which is ready to be processed by the MR system? In any case, more details are needed. On Thu, Aug 2, 2012 at 8:23 AM, Harsh J ha...@cloudera.com wrote: Mohammad, But it seems I am not doing things in correct way. Need some guidance. What do you mean by the above? What is your written code exactly expected to do and what is it not doing? Perhaps since you ask for a code question here, can you share it with us (pastebin or gists, etc.)? For skipping 8 lines, if you are using splits, you need to detect within the mapper or your record reader if the map task filesplit has an offset of 0 and skip 8 line reads if so (Cause its the first split of some file). On Thu, Aug 2, 2012 at 1:54 AM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I have a flat file in which data is stored as lines of 107 bytes each. I need to skip the first 8 lines(as they don't contain any valuable info). Thereafter, I have to read each line and extract the information from them, but not the line as a whole. Each line is composed of several fields without any delimiter between them. For example, the first field is of 8 bytes, second of 2 bytes and so on. I was trying to reach each line as a Text value, convert it into string and using String.subring() method to extract the value of each field. But it seems I am not doing things in correct way. Need some guidance. Many thanks. Regards, Mohammad Tariq -- Harsh J -- It's just about how deep your longing is!
Re: Reading fields from a Text line
Hi Tariq Again I strongly suspect the IdentityMapper in play here. The reasoning why I suspect so is When you have the whole data in output file it should be the Identity Mapper. Due to the mismatch in input key type at class level and method level the framework is falling back to IdentityMapper. I have noticed this fall back while using new mapreduce API. public static class XPTMapper extends Mapper*IntWritable*, Text, LongWritable, Text{ public void map(*LongWritable* key, Text value, Context context) throws IOException, InterruptedException{ When you change the Input Key type to LongWritable in class level, it is your custom mapper(XPTMapper) being called. Because of some exceptional cases it is just going into if condition where you are not writing anything out of Mapper and hence an empty output file. public static class XPTMapper extends Mapper*LongWritable*, Text, LongWritable, Text{ public void map(*LongWritable* key, Text value, Context context) throws IOException, InterruptedException{ To cross check this, try enabling some logging on your code to see exactly what is happening. By the way are you getting the output of this line in your logs when you change the input key type to LongWritable? context.setStatus(INVALID LINE..SKIPPING); If so that confirms my assumption. :) Try adding more logs to trace the flow and see what is going wrong. Or you can use MRunit to unit test your code as the first step. Hope it helps!.. Regards Bejoy KS
Re: All reducers are not being utilized
Hi Saurab/Steve From my understanding the schedulers in hadoop consider only data locality(for map tasks) and availability of slots for scheduling tasks on various nodes. Say if you have a 3 TT nodes with 2 reducer slots each (assume all slots are free) . If we execute a map reduce job with 3 reduce tasks there is no gaurentee that one task will be scheduled on each node. It can be like 2 in one node and 1 in another. Regards Bejoy KS
Re: DBOutputWriter timing out writing to database
Hi Nathan Alternatively you can have a look at Sqoop , which offers efficient data transfers between rdbms and hdfs. Regards Bejoy KS
Re: Reading fields from a Text line
Hi Tariq On further analysis I noticed a odd behavior in this context. If we use the default InputFormat (TextInputFormat) but specify the Key type in mapper as IntWritable instead of Long Writable. The framework is supposed throw a class cast exception.Such an exception is thrown only if the key types at class level and method level are the same (IntWritable) in Mapper. But if we provide the Input key type as IntWritable on the class level but LongWritable on the method level (map method), instead of throwing a compile time error, the code compliles fine . In addition to it on execution the framework triggers Identity Mapper instead of the custom mapper provided with the configuration. This seems like a bug to me . Filed a jira to track this issue https://issues.apache.org/jira/browse/MAPREDUCE-4507 Regards Bejoy KS
Re: java.io.IOException: Type mismatch in key from map: expected org.apache.hadoop.io.LongWritable, recieved org.apache.hadoop.io.Text
Hi Harit You need to set the Key Type as well. If you are using different Data Type for Key and Values in your map output with respect to reduce output then you need to specify both. //setting the map output data type classes job.setMapOutputKeyClass(LongWritable.class); job.setMapOutputValueClass(Text.class); //setting the final reduce output data type classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); Regards Bejoy KS
Re: Disable retries
Hi Marco You can disable retries by setting mapred.map.max.attempts and mapred.reduce.max.attempts to 1. Also if you need to disable speculative execution. You can disable it by setting mapred.map.tasks.speculative.execution and mapred.reduce.tasks.speculative.execution to false. With these two steps you can ensure that a task is attempted only once. These properties to be set in mapred-site.xml or at job level. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Marco Gallotta ma...@gallotta.co.za Date: Thu, 2 Aug 2012 16:52:00 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Disable retries Hi there Is there a way to disable retries when a mapper/reducer fails? I'm writing data in my mapper and I'd rather catch the failure, recover from a backup (fairly lightweight in this case, as the output tables aren't big) and restart. -- Marco Gallotta | Mountain View, California Software Engineer, Infrastructure | Loki Studios fb.me/marco.gallotta | twitter.com/marcog ma...@gallotta.co.za | +1 (650) 417-3313 Sent with Sparrow (http://www.sparrowmailapp.com/?sig)
Re: Merge Reducers Output
Hi Why not use 'hadoop fs -getMerge outputFolderInHdfs targetFileNameInLfs' while copying files out of hdfs for the end users to consume. This will merge all the files in 'outputFolderInHdfs' into one file and put it in lfs. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Michael Segel michael_se...@hotmail.com Date: Mon, 30 Jul 2012 21:08:22 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Merge Reducers Output Why not use a combiner? On Jul 30, 2012, at 7:59 PM, Mike S wrote: Liked asked several times, I need to merge my reducers output files. Imagine I have many reducers which will generate 200 files. Now to merge them together, I have written another map reduce job where each mapper read a complete file in full in memory, and output that and then only one reducer has to merge them together. To do so, I had to write a custom fileinputreader that reads the complete file into memory and then another custom fileoutputfileformat to append the each reducer item bytes together. this how my mapper and reducers looks like public static class MapClass extends MapperNullWritable, BytesWritable, IntWritable, BytesWritable { @Override public void map(NullWritable key, BytesWritable value, Context context) throws IOException, InterruptedException { context.write(key, value); } } public static class Reduce extends ReducerNullWritable, BytesWritable, NullWritable, BytesWritable { @Override public void reduce(NullWritable key, IterableBytesWritable values, Context context) throws IOException, InterruptedException { for (BytesWritable value : values) { context.write(NullWritable.get(), value); } } } I still have to have one reducers and that is a bottle neck. Please note that I must do this merging as the users of my MR job are outside my hadoop environment and the result as one file. Is there better way to merge reducers output files?
Re: Error reading task output
Hi Ben This error happens when the mapreduce job triggers more number of process than allowed by the underlying OS. You need to increase the nproc value if it is the default one. You can get the current values from linux using ulimit -u The default is 1024 I guess. Check that for the user that runs mapreduce jobs, for a non security enabled cluster it is mapred. You need to increase this to a laarge value using mapred soft nproc 1 mapred hard nproc 1 If you are running on a security enabled cluster, this value should be raised for the user who submits the job. Regards Bejoy KS
Re: Hadoop 1.0.3 start-daemon.sh doesn't start all the expected daemons
Hi Dinesh Try using $HADOOP_HOME/bin/start-all.sh . It starts all the hadoop daemons including TT and DN. Regards Bejoy KS
Re: Retrying connect to server: localhost/127.0.0.1:9000.
Hi Keith Your NameNode is not up still. What does the NN logs say? Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: anil gupta anilgupt...@gmail.com Date: Fri, 27 Jul 2012 11:30:57 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: Retrying connect to server: localhost/127.0.0.1:9000. Hi Keith, Does ping to localhost returns a reply? Try telneting to localhost 9000. Thanks, Anil On Fri, Jul 27, 2012 at 11:22 AM, Keith Wiley kwi...@keithwiley.com wrote: I'm plagued with this error: Retrying connect to server: localhost/127.0.0.1:9000. I'm trying to set up hadoop on a new machine, just a basic pseudo-distributed setup. I've done this quite a few times on other machines, but this time I'm kinda stuck. I formatted the namenode without obvious errors and ran start-all.sh with no errors to stdout. However, the logs are full of that error above and if I attempt to access hdfs (ala hadoop fs -ls /) I get that error again. Obviously, my core-site.xml sets fs.default.name to hdfs://localhost:9000. I assume something is wrong with /etc/hosts, but I'm not sure how to fix it. If hostname returns X and hostname -f returns Y, then what are the corresponding entries in /etc/hosts? Thanks for any help. Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com I used to be with it, but then they changed what it was. Now, what I'm with isn't it, and what's it seems weird and scary to me. -- Abe (Grandpa) Simpson -- Thanks Regards, Anil Gupta
Re: KeyValueTextInputFormat absent in hadoop-0.20.205
Hi Tariq KeyValueTextInputFormat is available from hadoop 1.0.1 version on wards for the new mapreduce API http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/mapreduce/lib/input/KeyValueTextInputFormat.html Regards Bejoy KS On Wed, Jul 25, 2012 at 8:07 PM, Mohammad Tariq donta...@gmail.com wrote: Hello list, I am trying to run a small MapReduce job that includes KeyValueTextInputFormat with the new API(hadoop-0.20.205.0), but it seems KeyValueTextInputFormat is not included in the new API. Am I correct??? Regards, Mohammad Tariq
Re: Unexpected end of input stream (GZ)
Hi Oleg From the job tracker page, you can get to the failed tasks and see which was the file split processed by that task. The split information is available under the status column for each task. The file split information is not available on job history. Regrads Bejoy KS On Tue, Jul 24, 2012 at 1:49 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Hi , I got such exception running hadoop job: java.io.EOFException: Unexpected end of input stream at org.apache.hadoop.io.compress.DecompressorStream.getCompressedData(DecompressorStream.java:99) at org.apache.hadoop.io.compress.DecompressorStream.decompress(DecompressorStream.java:87) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:75) at java.io.InputStream.read(InputStream.java:85) at org.apache.hadoop.util.LineReader.readDefaultLine(LineReader.java:205) at org.apache.hadoop.util.LineReader.readLine(LineReader.java:169) at org.apache.hadoop.mapreduce.lib.input.LineRecordReader.nextKeyValue(LineRecordReader.java:114) at org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:456) at org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child. As I understood some of my files are corrupted ( I am working with GZ format). I resolve the issue using conf.set(mapred.max.map.failures.percent , 1), But I don't know what file cause the problem. Question: How can I get a filename which is corrupted. Thanks in advance Oleg.
Re: fail and kill all tasks without killing job.
Hi Jay Did you try hadoop job -kill-task task-id ? And is that not working as desired? Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: jay vyas jayunit...@gmail.com Date: Fri, 20 Jul 2012 17:17:58 To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: fail and kill all tasks without killing job. Hi guys : I want my tasks to end/fail, but I don't want to kill my entire hadoop job. I have a hadoop job that runs 5 hadoop jobs in a row. Im on the last of those sub-jobs, and want to fail all tasks so that the task tracker stops delegating them, and the hadoop main job can naturally come to a close. However, when I run hadoop job kill-attempt / fail-attempt , the jobtracker seems to simply relaunch the same tasks with new ids. How can I tell the jobtracker to give up on redelegating?
Re: NameNode fails
Hi Yogesh Is your dfs.name.dir pointing to /tmp dir? If so try changing that to any other dir . The contents of /tmp may get wiped out on OS restarts. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: yogesh.kuma...@wipro.com Date: Fri, 20 Jul 2012 06:20:02 To: hdfs-user@hadoop.apache.org Reply-To: hdfs-user@hadoop.apache.org Subject: NameNode fails Hello All :-), I am new to Hdfs. I have installed single node hdfs and started all nodes, every nodes gets started and work fine. But when I shutdown my system or Restart it, then i try to run all nodes but Namenode doesn't start . to Start it i need to format the namenode and all data gets wash off :-(. Please help me and suggest me regarding this and how can I recover namenode from secondary name node on single node setup Thanks Regards Yogesh Kumar Dhari Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: Hadoop filesystem directories not visible
Hi Saniya In hdfs the directory exists only as meta data in the name node. There is no real hierarchical existence like normal file system. It is the data in the files that is stored as hdfs blocks distributed across data nodes. You see these hdfs blocks arranged in dfs.data.dir . Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Yuvrajsinh Chauhan yuvraj.chau...@elitecore.com Date: Thu, 19 Jul 2012 15:16:24 To: hdfs-user@hadoop.apache.org Reply-To: hdfs-user@hadoop.apache.org Subject: RE: Hadoop filesystem directories not visible Dear Saniya, I Second to you on this. Am also find exactly the same folder on secondary data node. Also, How can I write files from my external application ? Regards, Yuvrajsinh Chauhan || Sr. DBA || CRESTEL-PSG Elitecore Technologies Pvt. Ltd. 904, Silicon Tower || Off C.G.Road Behind Pariseema Building || Ahmedabad || INDIA [GSM]: +91 9727746022 From: Saniya Khalsa [mailto:saniya.kha...@gmail.com] Sent: 19 July 2012 14:58 To: hdfs-user@hadoop.apache.org Subject: Re: Hadoop filesystem directories not visible Hi Mohammad Tariq, Thanks for the reply!! The path to dfs.data.dir is /app/hadoop/tmp/dfs/data when i go there i find only these : BlocksBeingWriiten Current Detach In_use.lock storage tmp I am unable to see the created directories here. Regards Saniya On Thu, Jul 19, 2012 at 2:39 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Saniya, If you are talking about the local FS, then it will be present at the location specified as the value of 'dfs.data.dir' property in hdfs-site.xml file. Regards, Mohammad Tariq On Thu, Jul 19, 2012 at 1:09 PM, Saniya Khalsa saniya.kha...@gmail.com wrote: Hi, I ran these commands $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $HADOOP_HOME/bin/hadoop fs -mkdir /user The directories got created and I can now see the directories using following commands: [hadoop@master bin]$ ./hadoop fs -ls / Found 5 items drwxr-xr-x - hadoop supergroup 0 2012-07-16 14:11 /app drwxr-xr-x - hadoop supergroup 0 2012-07-17 17:41 /hadoop drwxr-xr-x - hadoop supergroup 0 2012-07-18 14:11 /hbase drwxr-xr-x - hadoop supergroup 0 2012-07-19 14:11 /tmp drwxr-xr-x - hadoop supergroup 0 2012-07-19 17:41 /user I can see this data from both the nodes by typing the command ,but i cannot view the directories created in the file path anywhere.Please tell me how to see these directories created in file system. Thanks
Re: Hadoop filesystem directories not visible
This can be good reference to start with http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample On Thu, Jul 19, 2012 at 3:42 PM, Mohammad Tariq donta...@gmail.com wrote: Hi Yuvraj, Yes. The starting point for the Hadoop file API is the 'FileSystem' class. Hadoop’s FileSystem provides the us with FSDataInputStream and FSDataInputStream classes for reading and writing files. Regards, Mohammad Tariq On Thu, Jul 19, 2012 at 3:31 PM, Yuvrajsinh Chauhan yuvraj.chau...@elitecore.com wrote: So, I understand that, If I want to write a file then I need to change the code of my external application need to integrate Hadoop read-write command/API. Regards, Yuvrajsinh Chauhan From: Saniya Khalsa [mailto:saniya.kha...@gmail.com] Sent: 19 July 2012 15:27 To: hdfs-user@hadoop.apache.org; bejoy.had...@gmail.com Subject: Re: Hadoop filesystem directories not visible Thanks Bejoy!! On Thu, Jul 19, 2012 at 3:22 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Saniya In hdfs the directory exists only as meta data in the name node. There is no real hierarchical existence like normal file system. It is the data in the files that is stored as hdfs blocks distributed across data nodes. You see these hdfs blocks arranged in dfs.data.dir . Regards Bejoy KS Sent from handheld, please excuse typos. From: Yuvrajsinh Chauhan yuvraj.chau...@elitecore.com Date: Thu, 19 Jul 2012 15:16:24 +0530 To: hdfs-user@hadoop.apache.org ReplyTo: hdfs-user@hadoop.apache.org Subject: RE: Hadoop filesystem directories not visible Dear Saniya, I Second to you on this. Am also find exactly the same folder on secondary data node. Also, How can I write files from my external application ? Regards, Yuvrajsinh Chauhan || Sr. DBA || CRESTEL-PSG Elitecore Technologies Pvt. Ltd. 904, Silicon Tower || Off C.G.Road Behind Pariseema Building || Ahmedabad || INDIA [GSM]: +91 9727746022 From: Saniya Khalsa [mailto:saniya.kha...@gmail.com] Sent: 19 July 2012 14:58 To: hdfs-user@hadoop.apache.org Subject: Re: Hadoop filesystem directories not visible Hi Mohammad Tariq, Thanks for the reply!! The path to dfs.data.dir is /app/hadoop/tmp/dfs/data when i go there i find only these : BlocksBeingWriiten Current Detach In_use.lock storage tmp I am unable to see the created directories here. Regards Saniya On Thu, Jul 19, 2012 at 2:39 PM, Mohammad Tariq donta...@gmail.com wrote: Hello Saniya, If you are talking about the local FS, then it will be present at the location specified as the value of 'dfs.data.dir' property in hdfs-site.xml file. Regards, Mohammad Tariq On Thu, Jul 19, 2012 at 1:09 PM, Saniya Khalsa saniya.kha...@gmail.com wrote: Hi, I ran these commands $HADOOP_HOME/bin/hadoop fs -mkdir /tmp $HADOOP_HOME/bin/hadoop fs -mkdir /user The directories got created and I can now see the directories using following commands: [hadoop@master bin]$ ./hadoop fs -ls / Found 5 items drwxr-xr-x - hadoop supergroup 0 2012-07-16 14:11 /app drwxr-xr-x - hadoop supergroup 0 2012-07-17 17:41 /hadoop drwxr-xr-x - hadoop supergroup 0 2012-07-18 14:11 /hbase drwxr-xr-x - hadoop supergroup 0 2012-07-19 14:11 /tmp drwxr-xr-x - hadoop supergroup 0 2012-07-19 17:41 /user I can see this data from both the nodes by typing the command ,but i cannot view the directories created in the file path anywhere.Please tell me how to see these directories created in file system. Thanks
Re: Loading data in hdfs
Hi Prabhjot Yes, Just use the filesystem commands hadoop fs -copyFromLocal src fs path destn hdfs path Regards Bejoy KS On Thu, Jul 19, 2012 at 3:49 PM, iwannaplay games funnlearnfork...@gmail.com wrote: Hi, I am unable to use sqoop and want to load data in hdfs for testing, Is there any way by which i can load my csv or text file to hadoop file system directly without writing code in java Regards Prabhjot
Re: Jobs randomly not starting
Hi Robert It could be because there are no free slots available in your cluster during job submission time to launch those tasks. Some other tasks may have already occupied the map/reduce slots. When you experience this random issue please verify whether there are free task slots available. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Robert Dyer psyb...@gmail.com Date: Thu, 12 Jul 2012 23:03:02 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Jobs randomly not starting I'm using Hadoop 1.0.3 on a small cluster (1 namenode, 1 jobtracker, 2 compute nodes). My input size is a sequence file of around 280mb. Generally, my jobs run just fine and all finish in 2-5 minutes. However, quite randomly the jobs refuse to run. They submit and appear when running 'hadoop job -list' but don't appear on the jobtracker's webpage. If I manually type in the job ID on the webpage I can see it is trying to run the setup task - the map tasks haven't even started. I've left them to run and even after several minutes it is still in this state. When I spot this, I kill the job and resubmit it and generally it works. A couple of times I have seen similar problems with reduce tasks that get stuck while 'initializing'. Any ideas?
Re: Error using MultipleInputs
Hi Sanchita Try your code after commenting the following Line of code, //conf.setInputFormat(TextInputFormat.class); AFAIK This explicitly sets the input format as TextInputFormat instead of MultipleInput and hence the compiler throws an error stating 'no input path specified'. Regards Bejoy KS On Thu, Jul 5, 2012 at 5:19 PM, Sanchita Adhya sad...@infocepts.com wrote: Hi, I am using cloudera's hadoop version - Hadoop 0.20.2-cdh3u3 and trying to use the MultipleInputs incorporating separate mapper class in the following manner- public static void main(String[] args) throws Exception { JobConf conf = new JobConf(IntegrateExisting.class); conf.setJobName(IntegrateExisting); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); Path existingKeysInputPath = new Path(args[0]); Path newKeysInputPath = new Path(args[1]); Path outputPath = new Path(args[2]); MultipleInputs.addInputPath(conf, existingKeysInputPath, TextInputFormat.class, MapExisting.class); MultipleInputs.addInputPath(conf, newKeysInputPath, TextInputFormat.class, MapNew.class); conf.setCombinerClass(ReduceAndFilterOut.class); conf.setReducerClass(ReduceAndFilterOut.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileOutputFormat.setOutputPath(conf, outputPath); //FileInputFormat.addInputPath(conf,existingKeysInputPath); //FileInputFormat.addInputPath(conf,newKeysInputPath); JobClient.runJob(conf); } Without the commented lines in the above code, the MR job fails with the following error- 12/07/05 16:59:25 ERROR security.UserGroupInformation: PriviledgedActionException as:root (auth:SIMPLE) cause:java.io.IOException: No input paths specified in job Exception in thread main java.io.IOException: No input paths specified in job at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:153 ) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:971) at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:963) at org.apache.hadoop.mapred.JobClient.access$500(JobClient.java:170) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:880) at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:833) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja va:1157) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:833) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:807) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1242) at org.myorg.IntegrateExisting.main(IntegrateExisting.java:122) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39 ) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl .java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:197) Uncommenting the lines, leads to the following error in the mappers- java.lang.ClassCastException: org.apache.hadoop.mapred.FileSplit cannot be cast to org.apache.hadoop.mapred.lib.TaggedInputSplit at org.apache.hadoop.mapred.lib.DelegatingMapper.map(DelegatingMapper.java:48) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:391) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:325) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.ja va:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) I see the MAPREDUCE-1178 that discusses the second error is included in the CDH3 version. Is there any code missing from the above piece? Thanks for the help. Regards, Sanchita
Re: Hive/Hdfs Connector
Hi Sandeep You can connect to hdfs from a remote machine if that machine is reachable from the cluster, and you have the hadoop jars and right hadoop configuration files. Similarly you can issue HQL programatically from your application using hive jdbc driver. --Original Message-- From: Sandeep Reddy P To: common-user@hadoop.apache.org To: cdh-u...@cloudera.org Cc: t...@cloudwick.com ReplyTo: common-user@hadoop.apache.org Subject: Hive/Hdfs Connector Sent: Jul 5, 2012 20:32 Hi, We have some application which generates SQL queries and connects to RDBMS using connectors like JDBC for mysql. Now if we generate HQL using our application is there any way to connect to Hive/Hdfs using connectors?? I need help on what connectors i have to use? We dont want to pull data from Hive/Hdfs to RDBMS instead we need our application to connect to Hive/Hdfs. -- Thanks, sandeep Regards Bejoy KS Sent from handheld, please excuse typos.
Re: change hdfs block size for file existing on HDFS
Hi Anurag, The easiest option would be , in your map reduce job set the dfs.block.size to 128 mb --Original Message-- From: Anurag Tangri To: hdfs-u...@hadoop.apache.org To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: change hdfs block size for file existing on HDFS Sent: Jun 26, 2012 11:07 Hi, We have a situation where all files that we have are 64 MB block size. I want to change these files (output of a map job mainly) to 128 MB blocks. What would be good way to do this migration from 64 mb to 128 mb block files ? Thanks, Anurag Tangri Regards Bejoy KS Sent from handheld, please excuse typos.
Re: change hdfs block size for file existing on HDFS
Hi Anurag, To add on, you can also change the replication of exiting files by hadoop fs -setrep http://hadoop.apache.org/common/docs/r0.20.0/hdfs_shell.html#setrep On Tue, Jun 26, 2012 at 7:42 PM, Bejoy KS bejoy.had...@gmail.com wrote: Hi Anurag, The easiest option would be , in your map reduce job set the dfs.block.size to 128 mb --Original Message-- From: Anurag Tangri To: hdfs-u...@hadoop.apache.org To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: change hdfs block size for file existing on HDFS Sent: Jun 26, 2012 11:07 Hi, We have a situation where all files that we have are 64 MB block size. I want to change these files (output of a map job mainly) to 128 MB blocks. What would be good way to do this migration from 64 mb to 128 mb block files ? Thanks, Anurag Tangri Regards Bejoy KS Sent from handheld, please excuse typos.
Re: change hdfs block size for file existing on HDFS
Hi Anurag, The easiest option would be , in your map reduce job set the dfs.block.size to 128 mb --Original Message-- From: Anurag Tangri To: hdfs-user@hadoop.apache.org To: common-u...@hadoop.apache.org ReplyTo: common-u...@hadoop.apache.org Subject: change hdfs block size for file existing on HDFS Sent: Jun 26, 2012 11:07 Hi, We have a situation where all files that we have are 64 MB block size. I want to change these files (output of a map job mainly) to 128 MB blocks. What would be good way to do this migration from 64 mb to 128 mb block files ? Thanks, Anurag Tangri Regards Bejoy KS Sent from handheld, please excuse typos.
Re: Streaming in mapreduce
Hi Pedro In simple terms Streaming API is used in hadoop if you have your mapper or reducer is in any language other than java . Say ruby or python. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Pedro Costa psdc1...@gmail.com Date: Sat, 16 Jun 2012 10:23:20 To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Re: Streaming in mapreduce I still don't get why hadoop streaming is useful. If I have man and reduce functions defined in shell script, like the one below, why should I use Hadoop? cat someInputFile | shellMapper.sh | shellReducer.sh someOutputFile On 16/06/2012, at 01:21, Ruslan Al-Fakikh metarus...@gmail.com wrote: Hi Pedro, You can find it here http://wiki.apache.org/hadoop/HadoopStreaming Thanks On Sat, Jun 16, 2012 at 2:46 AM, Pedro Costa psdc1...@gmail.com wrote: Hi, Hadoop mapreduce can be used for streaming. But what is streaming from the point of view of mapreduce? For me, streaming are video and audio data. Why mapreduce supports streaming? Can anyone give me an example on why to use streaming in mapreduce? Thanks, Pedro
Re: Setting number of mappers according to number of TextInput lines
Hi Ondrej You can use NLineInputFormat with n set to 10. --Original Message-- From: Ondřej Klimpera To: common-user@hadoop.apache.org ReplyTo: common-user@hadoop.apache.org Subject: Setting number of mappers according to number of TextInput lines Sent: Jun 16, 2012 14:31 Hello, I have very small input size (kB), but processing to produce some output takes several minutes. Is there a way how to say, file has 100 lines, i need 10 mappers, where each mapper node has to process 10 lines of input file? Thanks for advice. Ondrej Klimpera Regards Bejoy KS Sent from handheld, please excuse typos.
Re: [Newbie] How to make Multi Node Cluster from Single Node Cluster
You can follow the documents for 0.20.x . It is almost same for 1.0.x as well. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Alpha Bagus Sunggono bagusa...@gmail.com Date: Thu, 14 Jun 2012 17:15:16 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: [Newbie] How to make Multi Node Cluster from Single Node Cluster Hello ramon as newbie as I am 2012/6/14 ramon@accenture.com At Newbie level just the same. -Original Message- From: Alpha Bagus Sunggono [mailto:bagusa...@gmail.com] Sent: jueves, 14 de junio de 2012 12:01 To: common-user@hadoop.apache.org Subject: [Newbie] How to make Multi Node Cluster from Single Node Cluster Dear All. I've been configuring 3 server using Hadoop 1.0.x , Single Node, how to assembly them into 1 Multi Node Cluster? Because when I search for documentation, i've just got configuration for Hadoop 0.20.x Would you mind to assist me? Subject to local law, communications with Accenture and its affiliates including telephone calls and emails (including content), may be monitored by our systems for the purposes of security and the assessment of internal compliance with Accenture policy. __ www.accenture.com -- Alpha Bagus Sunggono, CBSP (Certified Brownies Solution Provider)
Re: Map/Reduce | Multiple node configuration
Hi Girish Lemme try answering your queries 1. For multiple nodes I understand I should add the URL of the secondary nodes in the slaves.xml. Am I correct? Bejoy: AFAIK you nedd to add it on /etc/hosts 2. What should be installed on the secondary nodes for executing a job/task? Bejoy: In small clusters you have the NameNode and JobTracker on one node , SecondaryNameNode on another node and DataNode and TaskTrackers on all other nodes. 3. I understand I can set the map/reduce classes as a jar to the Job - through the JobConf - so does this mean I need not really install/copy my map/reduce code on all the secondary nodes? Bejoy: There is no differnce in sub$itting jobs as compared to a pseudo node set up. MapReduce frame work distributes this job jar and other required files. It is better having a client node to launch jobs 4. How do I route the data to these nodes? Is it required for the Map Reduce to execute on the machines which has the data stored (DFS)? Bejoy: MR framework takes care of this. Map tasks consider data locality. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Girish Ravi giri...@srmtech.com Date: Tue, 12 Jun 2012 06:55:26 To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Map/Reduce | Multiple node configuration Hello Team, I have started to understand about Hadoop Mapreduce and was able to set-up a single cluster single node execution environment. I want to now extend this to a multi node environment. I have the following questions and it would very helpful if somebody can help: 1. For multiple nodes I understand I should add the URL of the secondary nodes in the slaves.xml. Am I correct? 2. What should be installed on the secondary nodes for executing a job/task? 3. I understand I can set the map/reduce classes as a jar to the Job - through the JobConf - so does this mean I need not really install/copy my map/reduce code on all the secondary nodes? 4. How do I route the data to these nodes? Is it required for the Map Reduce to execute on the machines which has the data stored (DFS)? Any samples for doing this would help. Request for suggestions. Regards Girish Ph: +91-9916212114
Re: Need logical help
Hi Girish You can achice this using reduce side joins. Use MultipleInputFormat for parsing two different sets of log files. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Girish Ravi giri...@srmtech.com Date: Tue, 12 Jun 2012 12:59:32 To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Need logical help Hi All, I am thinking of a condition where the data in two log files are to be compared, can I use Map-Reduce to do this? I have one log file (LOG1) which has user ID and dept ID and another log file (LOG2) has some rows which has user ID and dept ID and other data. Can I compare the data where LOG1.userID = LOG2.userID and LOG1.deptID = LOG2.deptID? If so any suggestion to implement the mapper for this? Regards Girish Ph: +91-9916212114
Re: Need logical help
To add on, have a look at hive and pig. Those are perfect fit for similar use cases. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Bejoy KS bejoy.had...@gmail.com Date: Tue, 12 Jun 2012 13:04:33 To: mapreduce-user@hadoop.apache.org Reply-To: bejoy.had...@gmail.com Subject: Re: Need logical help Hi Girish You can achice this using reduce side joins. Use MultipleInputFormat for parsing two different sets of log files. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Girish Ravi giri...@srmtech.com Date: Tue, 12 Jun 2012 12:59:32 To: mapreduce-user@hadoop.apache.orgmapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Need logical help Hi All, I am thinking of a condition where the data in two log files are to be compared, can I use Map-Reduce to do this? I have one log file (LOG1) which has user ID and dept ID and another log file (LOG2) has some rows which has user ID and dept ID and other data. Can I compare the data where LOG1.userID = LOG2.userID and LOG1.deptID = LOG2.deptID? If so any suggestion to implement the mapper for this? Regards Girish Ph: +91-9916212114
Re: set the mapred.map.tasks.speculative.execution=false, but it is not useful.
Hi If your intension is controlling the number of attempts every task make, then the property to be tweaked is mapred.map.max.attempts The default value is 4, for no map task re attempts make it 1. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: Jagat Singh jagatsi...@gmail.com Date: Tue, 12 Jun 2012 17:13:36 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: Re: set the mapred.map.tasks.speculative.execution=false, but it is not useful. Besides Speculative execution , Tasks can be attempted multiple times due to failures. So you can see 3 attempt there. On Tue, Jun 12, 2012 at 5:08 PM, 林育智 mylinyu...@gmail.com wrote: hi all: I set the mapred.map.tasks.speculative.execution=false,but in the userlogs,you could found 3 attempt map task's log. have i miss something? expect your help. thanks.
Re: Getting filename in case of MultipleInputs
Hi Subbu, The file/split processed by a mapper could be obtained from WebUI as soon as the job is executed. However this detail can't be obtained once the job is moved to JT history. Regards Bejoy On Thu, May 3, 2012 at 6:25 PM, Kasi Subrahmanyam kasisubbu...@gmail.com wrote: Hi, Could anyone suggest how to get the filename in the mapper. I have gone through the JIRA ticket that map.input.file doesnt work in case of multiple inputs,TaggedInputSplit also doesnt work in case of 0.20.2 version as it is not a public class. I tried to find any other approach than this but i could find none in the search Could anyone suggest a solution other tan these Thanks in advance; Subbu.
Re: updating datanode config files on namenode recovery
Hi Sumadhur, The easier approach is to make the hostname of the new NN same as the old one, else you'll have to update the new one on config files across cluster. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: sumadhur sumadhur_i...@yahoo.com Date: Tue, 1 May 2012 16:16:14 To: common-user@hadoop.apache.orgcommon-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: updating datanode config files on namenode recovery When a name node goes down and I bring up another machine as the new name node using the back up on a shared folder, do I have to update config files in each data node to point to the new name node and job tracker manually? Or is there some other way of doing it automatically? Thanks, Sumadhur
Re: reducers and data locality
Hi Mete A custom Paritioner class can control the flow of keys to the desired reducer. It gives you more control on which key to which reducer. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: mete efk...@gmail.com Date: Fri, 27 Apr 2012 09:19:21 To: common-user@hadoop.apache.org Reply-To: common-user@hadoop.apache.org Subject: reducers and data locality Hello folks, I have a lot of input splits (10k-50k - 128 mb blocks) which contains text files. I need to process those line by line, then copy the result into roughly equal size of shards. So i generate a random key (from a range of [0:numberOfShards]) which is used to route the map output to different reducers and the size is more less equal. I know that this is not really efficient and i was wondering if i could somehow control how keys are routed. For example could i generate the randomKeys with hostname prefixes and control which keys are sent to each reducer? What do you think? Kind regards Mete
Re: Reducer not firing
Hi Akro From the naming of output files, your job has the reduce phase. But the reducer being used is the IdentityReducer instead of your custom reducer. That is the reason you are seeing the same map output in the output files as well. You need to evaluate your code and logs to see why IdentityReducer is being triggered. Regards Bejoy KS Sent from handheld, please excuse typos. -Original Message- From: kasi subrahmanyam kasisubbu...@gmail.com Date: Tue, 17 Apr 2012 19:10:33 To: mapreduce-user@hadoop.apache.org Reply-To: mapreduce-user@hadoop.apache.org Subject: Re: Reducer not firing Could you comment the property where you are setting the number of reducer tasks and see the behaviour of the program once. If you already tried could you share the output On Tue, Apr 17, 2012 at 3:00 PM, Devaraj k devara...@huawei.com wrote: Can you check the task attempt logs in your cluster and find out what is happening in the reduce phase. By default task attempt logs present in $HADOOP_LOG_DIR/userlogs/job-id/. There could be some bug exist in your reducer which is leading to this output. Thanks Devaraj From: Arko Provo Mukherjee [arkoprovomukher...@gmail.com] Sent: Tuesday, April 17, 2012 2:07 PM To: mapreduce-user@hadoop.apache.org Subject: Re: Reducer not firing Hello, Many thanks for the reply. The 'no_of_reduce_tasks' is set to 2. I have a print statement before the code I pasted below to check that. Also I can find two output files part-r-0 and part-r-1. But they contain the values that has been outputted by the Mapper logic. Please let me know what I can check further. Thanks a lot in advance! Warm regards Arko On Tue, Apr 17, 2012 at 12:48 AM, Devaraj k devara...@huawei.com wrote: Hi Arko, What is value of 'no_of_reduce_tasks'? If no of reduce tasks are 0, then the map task will directly write map output into the Job output path. Thanks Devaraj From: Arko Provo Mukherjee [arkoprovomukher...@gmail.com] Sent: Tuesday, April 17, 2012 10:32 AM To: mapreduce-user@hadoop.apache.org Subject: Reducer not firing Dear All, I am porting code from the old API to the new API (Context objects) and run on Hadoop 0.20.203. Job job_first = new Job(); job_first.setJarByClass(My.class); job_first.setNumReduceTasks(no_of_reduce_tasks); job_first.setJobName(My_Job); FileInputFormat.addInputPath( job_first, new Path (Input_Path) ); FileOutputFormat.setOutputPath( job_first, new Path (Output_Path) ); job_first.setMapperClass(Map_First.class); job_first.setReducerClass(Reduce_First.class); job_first.setMapOutputKeyClass(IntWritable.class); job_first.setMapOutputValueClass(Text.class); job_first.setOutputKeyClass(NullWritable.class); job_first.setOutputValueClass(Text.class); job_first.waitForCompletion(true); The problem I am facing is that instead of emitting values to reducers, the mappers are directly writing their output in the OutputPath and the reducers and not processing anything. As read from the online materials that are available both my Map and Reduce method uses the context.write method to emit the values. Please help. Thanks a lot in advance!! Warm regards Arko
Re: map and reduce with different value classes
HI Bryan You can set different key and value types with the following steps - ensure that the map output key value type is the reducer input key value type - specify it on your Driver Class as //set map output key value types job.setMapOutputKeyClass(theClass) job.setMapOutputValueClass(theClass) //set final/reduce output key value types job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class) If both map output and reduce output key value types are the same you just need to specify the final output types. Regards Bejoy KS On Tue, Apr 17, 2012 at 7:14 AM, Bryan Yeung brye...@gmail.com wrote: Hello Everyone, I'm relatively new to hadoop mapreduce and I'm trying to get this simple modification to the WordCount example to work. I'm using hadoop-1.0.2, and I've included both a convenient diff and also attached my new WordCount.java file. The thing I am trying to achieve is to have the value class that is output by the map phase be different than the value class output by the reduce phase. Any help would be greatly appreciated! Thanks, Bryan diff --git a/WordCount.java.orig b/WordCount.java index 81a6c21..6a768f7 100644 --- a/WordCount.java.orig +++ b/WordCount.java @@ -33,8 +33,8 @@ public class WordCount { } public static class IntSumReducer - extends ReducerText,IntWritable,Text,IntWritable { - private IntWritable result = new IntWritable(); + extends ReducerText,IntWritable,Text,Text { + private Text result = new Text(); public void reduce(Text key, IterableIntWritable values, Context context @@ -43,7 +43,7 @@ public class WordCount { for (IntWritable val : values) { sum += val.get(); } - result.set(sum); + result.set( + sum); context.write(key, result); } } @@ -58,10 +58,11 @@ public class WordCount { Job job = new Job(conf, word count); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); + job.setMapOutputValueClass(IntWritable.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class); - job.setOutputValueClass(IntWritable.class); + job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path(otherArgs[0])); FileOutputFormat.setOutputPath(job, new Path(otherArgs[1])); System.exit(job.waitForCompletion(true) ? 0 : 1);