Long running Yarn Applications on a secured HA cluster?
Hi, I'm working on a project that uses Apache Flink (stream processing) on top of a secured HA Yarn cluster. The test application I've been testing with just uses HBase (it writes the current time in a column every minute). The problem I have is that after 173.5 hours (exactly) my application dies. The best assessment we have right now is that the Hadoop Delegation Tokens are expiring. I know for sure the Kerberos tickets are correctly renewed/recreated in the cluster using my keytab file because I had our IT-ops guys drop the max ticket life to 5 minutes and the max renew to 10 minutes. Following what we found on these two web sites we set those settings (I posted the current settings that seem relevant below). http://www.cloudera.com/documentation/enterprise/5-3-x/topics/cm_sg_yarn_long_jobs.html https://forge.puppetlabs.com/cesnet/hadoop/2.1.0#long-running-applications Yet this has not changed the situation, the job still dies after 173.5 hours with this exception. 15:47:55,283 INFO org.apache.flink.yarn.YarnJobManager - Status of job 2e4a3516d8e4876b705eaff4a52fc272 (Long running Flink application) changed to FAILING. java.lang.Exception: Serialized representation of org.apache.hadoop.hbase.client.RetriesExhaustedWithDetailsException: Failed 1 action: FailedServerException: 1 time, at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.makeException(AsyncProcess.java:224) at org.apache.hadoop.hbase.client.AsyncProcess$BatchErrors.access$1700(AsyncProcess.java:204) at org.apache.hadoop.hbase.client.AsyncProcess.waitForAllPreviousOpsAndReset(AsyncProcess.java:1597) at org.apache.hadoop.hbase.client.HTable.backgroundFlushCommits(HTable.java:1069) at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:1344) at org.apache.hadoop.hbase.client.HTable.put(HTable.java:1001) at nl.basjes.flink.experiments.SetHBaseRowSink.invoke(SetHBaseRowSink.java:58) Me and my colleagues did some searching and these two seem to describe a similar problem to what we see (just instead of HBase these reports are about HDFS): Failed to Update HDFS Delegation Token for long running application in HA mode https://issues.apache.org/jira/browse/HDFS-9276 HDFS Delegation Token will be expired when calling "UserGroupInformation.getCurrentUser.addCredentials" in HA mode https://issues.apache.org/jira/browse/SPARK-11182 My question to you guys is simply put: How do I fix this problem? How do I figure out what the problem really is? Thanks for any suggestions you have for us. yarn.resourcemanager.proxy-user-privileges.enabled true yarn-site.xml dfs.namenode.delegation.token.renew-interval 8640 hdfs-default.xml dfs.namenode.delegation.key.update-interval 8640 hdfs-default.xml dfs.namenode.delegation.token.max-lifetime 60480 hdfs-default.xml yarn.resourcemanager.webapp.delegation-token-auth-filter.enabled true yarn-default.xml yarn.resourcemanager.delayed.delegation-token.removal-interval-ms 3 yarn-default.xml hadoop.proxyuser.hbase.hosts * core-site.xml hadoop.proxyuser.hbase.groups * core-site.xml hadoop.proxyuser.yarn.hosts * core-site.xml hadoop.proxyuser.yarn.groups * core-site.xml -- Best regards / Met vriendelijke groeten, Niels Basjes
Flink job on secure Yarn fails after many hours
Hi, We have a Kerberos secured Yarn cluster here and I'm experimenting with Apache Flink on top of that. A few days ago I started a very simple Flink application (just stream the time as a String into HBase 10 times per second). I (deliberately) asked our IT-ops guys to make my account have a max ticket time of 5 minutes and a max renew time of 10 minutes (yes, ridiculously low timeout values because I needed to validate this https://issues.apache.org/jira/browse/FLINK-2977). This job is started with a keytab file and after running for 31 hours it suddenly failed with the exception you see below. I had the same job running for almost 400 hours until that failed too (I was too late to check the logfiles but I suspect the same problem). So in that time span my tickets have expired and new tickets have been obtained several hundred times. The main error I see is that in the process of a ticket expiring and being renewed I see this message: *Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked* Yarn on the cluster is 2.6.0 ( HDP 2.6.0.2.2.4.2-2 ) Flink is version 0.10.1 How do I fix this? Is this a bug (in either Hadoop or Flink) or am I doing something wrong? Would upgrading Yarn to 2.7.1 (i.e. HDP 2.3) fix this? Niels Basjes 21:30:27,821 WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_01 21:30:27,861 WARN org.apache.hadoop.ipc.Client - Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_01 21:30:27,861 WARN org.apache.hadoop.security.UserGroupInformation - PriviledgedActionException as:nbasjes (auth:SIMPLE) cause:org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.token.SecretManager$InvalidToken): Invalid AMRMToken from appattempt_1443166961758_163901_01*21:30:27,891 WARN org.apache.hadoop.io.retry.RetryInvocationHandler - Exception while invoking class org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate. Not retrying because the invoked method is not idempotent, and unable to determine whether it was invoked *org.apache.hadoop.security.token.SecretManager$InvalidToken: Invalid AMRMToken from appattempt_1443166961758_163901_01 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53) at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:104) at org.apache.hadoop.yarn.api.impl.pb.client.ApplicationMasterProtocolPBClientImpl.allocate(ApplicationMasterProtocolPBClientImpl.java:79) at sun.reflect.GeneratedMethodAccessor3.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:186) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy14.allocate(Unknown Source) at org.apache.hadoop.yarn.client.api.impl.AMRMClientImpl.allocate(AMRMClientImpl.java:245) at org.apache.flink.yarn.YarnJobManager$$anonfun$handleYarnMessage$1.applyOrElse(YarnJobManager.scala:259) at scala.PartialFunction$OrElse.apply(PartialFunction.scala:162) at org.apache.flink.runtime.LeaderSessionMessageFilter$$anonfun$receive$1.applyOrElse(LeaderSessionMessageFilter.scala:44) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33) at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:33) at org.apache.flink.runtime.LogMessages$$anon$1.apply(LogMessages.scala:28) at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118) at org.apache.flink.runtime.LogMessages$$anon$1.applyOrElse(LogMessages.scala:28) at akka.actor.Actor$class.aroundReceive(Actor.scala:465
Re: Not able to run more than one map task
Just curious: what is the input for your job ? If it is a single gzipped file then that is the cause of getting exactly 1 mapper. Niels On Fri, Apr 10, 2015, 09:21 Amit Kumar amiti...@msn.com wrote: Thanks a lot Harsha for replying This problem has waster at least last one week. We tried what you suggested. Could you please take a look at the configuration and suggest if we missed c? System RAM : 8GB CPU : 4 threads each with 2 cores. # Disks : 1 MR2: mapreduce.map.memory.mb : 512 mapreduce.tasktracker.map.tasks.maximum : 4 Yarn: yarn.app.mapreduce.am.resource.mb : 512 yarn.nodemanager.resource.cpu-vcores : 4 yarn.scheduler.minimum-allocation-mb : 512 yarn.nodemanager.resource.memory-mb : 5080 Regards, Amit From: ha...@cloudera.com Date: Fri, 10 Apr 2015 10:20:24 +0530 Subject: Re: Not able to run more than one map task To: user@hadoop.apache.org You are likely memory/vcore starved in the NM's configs. Increase your yarn.nodemanager.resource.memory-mb and yarn.nodemanager.resource.cpu-vcores configs, or consider lowering the MR job memory request values to gain more parallelism. On Thu, Apr 9, 2015 at 5:05 PM, Amit Kumar amiti...@msn.com wrote: Hi All, We recently started working on Hadoop. We have setup the hadoop in pseduo distribution mode along with oozie. Every developer has set it up on his laptop. The problem is that we are not able to run more than one map task concurrently on our laptops. Resource manager is not allowing more than one task on our machine. My task gets completed if I submit it without Oozie. Oozie requires one map task for its own functioning. Actual task that oozie submit does not start. Here is my configuration -- Hadoop setup in Pseudo distribution mode -- Hadoop Version - 2.6 -- Oozie Version - 4.0.1 Regards, Amit -- Harsh J
Re: way to add custom udf jar in hadoop 2.x version
Hi, These options: - HIVE_HOME/auxlib - http://stackoverflow.com/questions/14032924/how-to-add-serde-jar - ADD JAR commands in your $HOME/.hiverc file either require IT operations to put my JAR on all nodes OR I cannot share it, Only works on the commandline and it won't work in HUE/Beeswax. Now Permanent Functions: - https://issues.apache.org/jira/browse/HIVE-6047 - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PermanentFunctions What these Permanent Functions do is: 1) put the jar on the cluster without IT operations putting the jar on all nodes 2) the jar is used transparently for everyone who want to use the function. I am writing a deserializer [1] (Not finished yet: https://github.com/nielsbasjes/logparser/blob/master/README-Hive.md) that should make existing files query-able as an external table in Hive. Question is: Is there something similar for CREATE EXTERNAL TABLE ?? Something like CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ... STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ]; Is this something for which there is already a JIRA (couldn't find it)? If not; Should I create one? (I.e. do you think this would make sense for others?) Niels Basjes On Fri, Jan 2, 2015 at 9:00 PM, Yakubovich, Alexey alexey.yakubov...@searshc.com wrote: Try to look hr: http://stackoverflow.com/questions/14032924/how-to-add-serde-jar Another advice: insert your ADD JAR commands in your $HOME/.hiverc file and start hive. ( http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3ccamgr+0h3smdw4zhtpyo5b1b4iob05bpw8ls+daeh595qzid...@mail.gmail.com%3E ) From: Ted Yu yuzhih...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Wednesday, December 31, 2014 at 8:25 AM To: d...@hive.apache.org d...@hive.apache.org Subject: Fwd: way to add custom udf jar in hadoop 2.x version Forwarding Niels' question to hive mailing list. On Wed, Dec 31, 2014 at 1:24 AM, Niels Basjes ni...@basjes.nl wrote: Thanks for the pointer. This seems to work for functions. Is there something similar for CREATE EXTERNAL TABLE ?? Niels On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote: Have you seen this thread ? http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com wrote: Hi, I am using hadoop 2.4.0 version. I have created custom udf jar. I am trying to execute a simple select udf query using java hive jdbc client program. When hive execute the query using map reduce job, then the query execution get fails because the mapper is not able to locate the udf class. So I wanted to add the udf jar in hadoop environment permanently. Please suggest me a way to add this external jar for single node and multi node hadoop cluster. PS: I am using hive 0.13.1 version and I already have this custom udf jar added in HIVE_HOME/lib directory. Thanks This message, including any attachments, is the property of Sears Holdings Corporation and/or one of its subsidiaries. It is confidential and may contain proprietary or legally privileged information. If you are not the intended recipient, please delete it without reading the contents. Thank you. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: way to add custom udf jar in hadoop 2.x version
I created https://issues.apache.org/jira/browse/HIVE-9252 for this improvement. On Sun, Jan 4, 2015 at 5:16 PM, Niels Basjes ni...@basjes.nl wrote: Hi, These options: - HIVE_HOME/auxlib - http://stackoverflow.com/questions/14032924/how-to-add-serde-jar - ADD JAR commands in your $HOME/.hiverc file either require IT operations to put my JAR on all nodes OR I cannot share it, Only works on the commandline and it won't work in HUE/Beeswax. Now Permanent Functions: - https://issues.apache.org/jira/browse/HIVE-6047 - https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-PermanentFunctions What these Permanent Functions do is: 1) put the jar on the cluster without IT operations putting the jar on all nodes 2) the jar is used transparently for everyone who want to use the function. I am writing a deserializer [1] (Not finished yet: https://github.com/nielsbasjes/logparser/blob/master/README-Hive.md) that should make existing files query-able as an external table in Hive. Question is: Is there something similar for CREATE EXTERNAL TABLE ?? Something like CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name ... STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] [USING JAR|FILE|ARCHIVE 'file_uri' [, JAR|FILE|ARCHIVE 'file_uri'] ]; Is this something for which there is already a JIRA (couldn't find it)? If not; Should I create one? (I.e. do you think this would make sense for others?) Niels Basjes On Fri, Jan 2, 2015 at 9:00 PM, Yakubovich, Alexey alexey.yakubov...@searshc.com wrote: Try to look hr: http://stackoverflow.com/questions/14032924/how-to-add-serde-jar Another advice: insert your ADD JAR commands in your $HOME/.hiverc file and start hive. ( http://mail-archives.apache.org/mod_mbox/hive-user/201303.mbox/%3ccamgr+0h3smdw4zhtpyo5b1b4iob05bpw8ls+daeh595qzid...@mail.gmail.com%3E ) From: Ted Yu yuzhih...@gmail.com Reply-To: user@hadoop.apache.org user@hadoop.apache.org Date: Wednesday, December 31, 2014 at 8:25 AM To: d...@hive.apache.org d...@hive.apache.org Subject: Fwd: way to add custom udf jar in hadoop 2.x version Forwarding Niels' question to hive mailing list. On Wed, Dec 31, 2014 at 1:24 AM, Niels Basjes ni...@basjes.nl wrote: Thanks for the pointer. This seems to work for functions. Is there something similar for CREATE EXTERNAL TABLE ?? Niels On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote: Have you seen this thread ? http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com wrote: Hi, I am using hadoop 2.4.0 version. I have created custom udf jar. I am trying to execute a simple select udf query using java hive jdbc client program. When hive execute the query using map reduce job, then the query execution get fails because the mapper is not able to locate the udf class. So I wanted to add the udf jar in hadoop environment permanently. Please suggest me a way to add this external jar for single node and multi node hadoop cluster. PS: I am using hive 0.13.1 version and I already have this custom udf jar added in HIVE_HOME/lib directory. Thanks This message, including any attachments, is the property of Sears Holdings Corporation and/or one of its subsidiaries. It is confidential and may contain proprietary or legally privileged information. If you are not the intended recipient, please delete it without reading the contents. Thank you. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: way to add custom udf jar in hadoop 2.x version
Thanks for the pointer. This seems to work for functions. Is there something similar for CREATE EXTERNAL TABLE ?? Niels On Dec 31, 2014 8:13 AM, Ted Yu yuzhih...@gmail.com wrote: Have you seen this thread ? http://search-hadoop.com/m/8er9TcALc/Hive+udf+custom+jarsubj=Best+way+to+add+custom+UDF+jar+in+HiveServer2 On Dec 30, 2014, at 10:56 PM, reena upadhyay reena2...@gmail.com wrote: Hi, I am using hadoop 2.4.0 version. I have created custom udf jar. I am trying to execute a simple select udf query using java hive jdbc client program. When hive execute the query using map reduce job, then the query execution get fails because the mapper is not able to locate the udf class. So I wanted to add the udf jar in hadoop environment permanently. Please suggest me a way to add this external jar for single node and multi node hadoop cluster. PS: I am using hive 0.13.1 version and I already have this custom udf jar added in HIVE_HOME/lib directory. Thanks
Re: to all this unsubscribe sender
Yes, I agree. We should accept people as they are. So perhaps we should increase the hurdle to subscribe in the first place? Something like adding a question like What do you do if you want to unsubscribe from a mailing list? That way the people who are lazy or need hand holding are simply unable to subscribe, thus avoiding these people who send these unsubscribe me messages from ever entering the list. On Fri, Dec 5, 2014 at 2:49 PM, mark charts mcha...@yahoo.com wrote: Trained as an electrical engineer I learned again and again from my instructors: man is inherently lazy. Maybe that's the case. Also, some people have complicated lives and need hand holding. It is best to accept people as they are. On Friday, December 5, 2014 8:40 AM, Chana chana.ole...@gmail.com wrote: Actually, it's capisce. And, obviously, we are all here because we are learning. But, now that instructions have been posted again - I would expect that people would not expect to have their hands held to do something so simple as unsubscribe. They are supposedly technologists. I would think they would be able to do whatever research is necessary to unsubscribe without flooding the list with requests. It's a fairly common operation for anyone who has been on the web for a while. On Fri, Dec 5, 2014 at 7:35 AM, mark charts mcha...@yahoo.com wrote: Not rude at all. We were not born with any knowledge. We learn as we go. We learn by asking. Capiche? Mark Charts On Friday, December 5, 2014 8:13 AM, Chana chana.ole...@gmail.com wrote: THANK YOU for posting this Very rude indeed to not learn *for oneself* - as opposed to expecting to be spoon fed information like a small child. On Fri, Dec 5, 2014 at 7:00 AM, Aleks Laz al-userhad...@none.at wrote: Dear wished unsubscriber Is it really that hard to start brain vX.X? You or anyone which have used your email account, have subscribed to this list, as described here. http://hadoop.apache.org/mailing_lists.html You or this Person must unsubscribe at the same way as the subscription was, as described here. http://hadoop.apache.org/mailing_lists.html In the case that YOU don't know how a mailinglist works, please take a look here. http://en.wikipedia.org/wiki/Electronic_mailing_list Even the month is young there are a lot of unsubscribe in the archive. http://mail-archives.apache.org/mod_mbox/hadoop-user/201412.mbox/thread I'm new to this list but from my point of view it is very disrespectful to the list members and developers that YOU don't invest a little bit of time by your self to search how you can unsubscribe from a list on which YOU have subscribed or anyone which have used your email account. cheers Aleks -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: to all this unsubscribe sender
+1 On Fri, Dec 5, 2014 at 4:28 PM, Ted Yu yuzhih...@gmail.com wrote: +1 On Fri, Dec 5, 2014 at 7:22 AM, Aleks Laz al-userhad...@none.at wrote: +1 Am 05-12-2014 16:12, schrieb mark charts: I concur. Good idea. On Friday, December 5, 2014 10:10 AM, Amjad Syed amjad...@gmail.com wrote: My take on this is simple. The owner/maintainer / administrator of the list can implement a filter where if the subject of the email has word unsubscribe . That email gets blocked and sender gets automated reply stating that if you want to unsubscribe use the unsubscribe list. On 5 Dec 2014 18:05, Niels Basjes ni...@basjes.nl wrote: Yes, I agree. We should accept people as they are. So perhaps we should increase the hurdle to subscribe in the first place? Something like adding a question like What do you do if you want to unsubscribe from a mailing list? That way the people who are lazy or need hand holding are simply unable to subscribe, thus avoiding these people who send these unsubscribe me messages from ever entering the list. On Fri, Dec 5, 2014 at 2:49 PM, mark charts mcha...@yahoo.com wrote: Trained as an electrical engineer I learned again and again from my instructors: man is inherently lazy. Maybe that's the case. Also, some people have complicated lives and need hand holding. It is best to accept people as they are. On Friday, December 5, 2014 8:40 AM, Chana chana.ole...@gmail.com wrote: Actually, it's capisce. And, obviously, we are all here because we are learning. But, now that instructions have been posted again - I would expect that people would not expect to have their hands held to do something so simple as unsubscribe. They are supposedly technologists. I would think they would be able to do whatever research is necessary to unsubscribe without flooding the list with requests. It's a fairly common operation for anyone who has been on the web for a while. On Fri, Dec 5, 2014 at 7:35 AM, mark charts mcha...@yahoo.com wrote: Not rude at all. We were not born with any knowledge. We learn as we go. We learn by asking. Capiche? Mark Charts On Friday, December 5, 2014 8:13 AM, Chana chana.ole...@gmail.com wrote: THANK YOU for posting this Very rude indeed to not learn *for oneself* - as opposed to expecting to be spoon fed information like a small child. On Fri, Dec 5, 2014 at 7:00 AM, Aleks Laz al-userhad...@none.at wrote: Dear wished unsubscriber Is it really that hard to start brain vX.X? You or anyone which have used your email account, have subscribed to this list, as described here. http://hadoop.apache.org/mailing_lists.html You or this Person must unsubscribe at the same way as the subscription was, as described here. http://hadoop.apache.org/mailing_lists.html In the case that YOU don't know how a mailinglist works, please take a look here. http://en.wikipedia.org/wiki/Electronic_mailing_list Even the month is young there are a lot of unsubscribe in the archive. http://mail-archives.apache.org/mod_mbox/hadoop-user/201412.mbox/thread I'm new to this list but from my point of view it is very disrespectful to the list members and developers that YOU don't invest a little bit of time by your self to search how you can unsubscribe from a list on which YOU have subscribed or anyone which have used your email account. cheers Aleks -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Are these configuration parameters deprecated?
A while ago I found a similar problem; I was wondering why do a lot of tools like the hdfs, hbase shell, pig and many other complain at startup about deprecated parameters. It turns out that these deprecated names are still in *-default.xml files and in various other places in the code base. Perhaps an issue indicating that the use of the deprecated parameters should be removed from the main code base is in order here. Niels Basjes On Fri, Nov 14, 2014 at 9:22 PM, Tianyin Xu t...@cs.ucsd.edu wrote: Hi, I'm very confused by some of the MapReduce configuration parameters which appear in the latest version of mapred-default.xml. http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml Take mapreduce.task.tmp.dir as an example, I fail to find its usage in code but /* mapreduce/util/ConfigUtil.java */ 55 new DeprecationDelta(mapred.temp.dir, 56 MRConfig.TEMP_DIR), My interpretation is that it's renamed into mapred.temp.dir. However, when I grep the new name, I still cannot find any code except some testing ones in ./hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/BenchmarkThroughput.java From the semantics, this should be a must-to-have parameter for MR jobs... Also, many parameters are like this. So I'm really confused. Am I missing something? Are these parameters deprecated? Thanks a lot! Tianyin -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Spark vs Tez
Very interesting! What makes Tez more scalable than Spark? What architectural thing makes the difference? Niels Basjes On Oct 19, 2014 3:07 AM, Jeff Zhang zjf...@gmail.com wrote: Tez has a feature called pre-warm which will launch JVM before you use it and you can reuse the container afterwards. So it is also suitable for interactive queries and is more stable and scalable than spark IMO. On Sat, Oct 18, 2014 at 4:22 PM, Niels Basjes ni...@basjes.nl wrote: It is my understanding that one of the big differences between Tez and Spark is is that a Tez based query still has the startup overhead of starting JVMs on the Yarn cluster. Spark based queries are immediately executed on already running JVMs. So for interactive dashboards Spark seems more suitable. Did I understand correctly? Niels Basjes On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote: Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently. It is a big pain in MR. Also they both try to use Memory as the buffer instead of only filesystems. Spark has a concept RDD, which is quite interesting and also limited. On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. *From:* Shahab Yunus shahab.yu...@gmail.com *Sent:* Friday, October 17, 2014 1:12 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs Tez What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B. -- Best Regards Jeff Zhang
Re: Spark vs Tez
It is my understanding that one of the big differences between Tez and Spark is is that a Tez based query still has the startup overhead of starting JVMs on the Yarn cluster. Spark based queries are immediately executed on already running JVMs. So for interactive dashboards Spark seems more suitable. Did I understand correctly? Niels Basjes On Oct 17, 2014 8:30 PM, Gavin Yue yue.yuany...@gmail.com wrote: Spark and tez both make MR faster, this has no doubt. They also provide new features like DAG, which is quite important for interactive query processing. From this perspective, you could view them as a wrapper around MR and try to handle the intermediary buffer(files) more efficiently. It is a big pain in MR. Also they both try to use Memory as the buffer instead of only filesystems. Spark has a concept RDD, which is quite interesting and also limited. On Fri, Oct 17, 2014 at 11:23 AM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: It was my understanding that Spark is faster batch processing. Tez is the new execution engine that replaces MapReduce and is also supposed to speed up batch processing. Is that not correct? B. *From:* Shahab Yunus shahab.yu...@gmail.com *Sent:* Friday, October 17, 2014 1:12 PM *To:* user@hadoop.apache.org *Subject:* Re: Spark vs Tez What aspects of Tez and Spark are you comparing? They have different purposes and thus not directly comparable, as far as I understand. Regards, Shahab On Fri, Oct 17, 2014 at 2:06 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Does anybody have any performance figures on how Spark stacks up against Tez? If you don’t have figures, does anybody have an opinion? Spark seems so popular but I’m not really seeing why. B.
Re: Bzip2 files as an input to MR job
Hi, You can use the GZip inside the AVRO files and still have splittable AVRO files. This has the to with the fact that there is a block structure inside the AVRO and these blocks are gzipped. I suggest you simply try it. Niels On Mon, Sep 22, 2014 at 4:40 PM, Georgi Ivanov iva...@vesseltracker.com wrote: Hi guys, I would like to compress the files on HDFS to save some storage. As far as i see bzip2 is the only format which is splitable (and slow). The actual files are Avro. So in my driver class i have : job.setInputFormatClass(AvroKeyInputFormat.class); I have number of jobs running processing Avro files so i would like to keep the code change to a minimum. Is it possible to comrpess these avro files with bzip2 and keep the code of MR jobs the same (or with little change) If it is , please give me some hints as so far i don't seem to find any good resources on the Internet. Georgi -- Best regards / Met vriendelijke groeten, Niels Basjes
Generating mysql or sqlite datafiles from Hadoop (Java)?
Hi, I remember hearing a while ago that (if I remember correctly) Facebook had an outputformat that wrote the underlying MySQL database files directly from a MapReduce job. For my purpose an sqlite datafile would be good enough too. I've been unable to find any of those two solutions by simply googling. Does anyone know where I can find such a thing? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Why LineRecordWriter.write(..) is synchronized
I expect the impact on the IO speed to be almost 0 because waiting for a single disk seek is longer than many thousands of calls to a synchronized method. Niels On Aug 11, 2013 3:00 PM, Harsh J ha...@cloudera.com wrote: Yes, I feel we could discuss this over a JIRA to remove it if it hurts perf. too much, but it would have to be a marked incompatible change, and we have to add a note about the lack of thread safety in the javadoc of base Mapper/Reducer classes. On Sun, Aug 11, 2013 at 1:26 PM, Sathwik B P sathwik...@gmail.com wrote: Hi Harsh, Does it make any sense to keep the method in LRW still synchronized. Isn't it creating unnecessary overhead for non multi threaded implementations. regards, sathwik On Fri, Aug 9, 2013 at 7:16 AM, Harsh J ha...@cloudera.com wrote: I suppose I should have been clearer. There's no problem out of box if people stick to the libraries we offer :) Yes the LRW was marked synchronized at some point over 8 years ago [1] in support for multi-threaded maps, but the framework has changed much since then. The MultithreadedMapper/etc. API we offer now automatically shields the devs away from having to think of output thread safety [2]. I can imagine there can only be a problem if a user writes their own unsafe multi threaded task. I suppose we could document that in the Mapper/MapRunner and Reducer APIs. [1] - http://svn.apache.org/viewvc?view=revisionrevision=171186 - Commit added a synchronized to the write call. [2] - MultiThreadedMapper/etc. synchronize over the collector - http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/map/MultithreadedMapper.java?view=markup On Thu, Aug 8, 2013 at 7:52 PM, Azuryy Yu azury...@gmail.com wrote: sequence writer is also synchronized, I dont think this is bad. if you call HDFS api to write concurrently, then its necessary. On Aug 8, 2013 7:53 PM, Jay Vyas jayunit...@gmail.com wrote: Then is this a bug? Synchronization in absence of any race condition is normally considered bad. In any case id like to know why this writer is synchronized whereas the other one are not.. That is, I think, then point at issue: either other writers should be synchronized or else this one shouldn't be - consistency across the write implementations is probably desirable so that changes to output formats or record writers don't lead to bugs in multithreaded environments . On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote: While we don't fork by default, we do provide a MultithreadedMapper implementation that would require such synchronization. But if you are asking is it necessary, then perhaps the answer is no. On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote: its not hadoop forked threads, we may create a line record writer, then call this writer concurrently. On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote: Hi, Thanks for your reply. May I know where does hadoop fork multiple threads to use a single RecordWriter. regards, sathwik On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote: because we may use multi-threads to write a single file. On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote: Hi, LineRecordWriter.write(..) is synchronized. I did not find any other RecordWriter implementations define the write as synchronized. Any specific reason for this. regards, sathwik -- Harsh J -- Harsh J
Re: Why LineRecordWriter.write(..) is synchronized
I may be nitpicking here but if perhaps the answer is no then I conclude: Perhaps the other implementations of RecordWriter are a race condition/file corruption ready to occur. On Thu, Aug 8, 2013 at 12:50 PM, Harsh J ha...@cloudera.com wrote: While we don't fork by default, we do provide a MultithreadedMapper implementation that would require such synchronization. But if you are asking is it necessary, then perhaps the answer is no. On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote: its not hadoop forked threads, we may create a line record writer, then call this writer concurrently. On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote: Hi, Thanks for your reply. May I know where does hadoop fork multiple threads to use a single RecordWriter. regards, sathwik On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote: because we may use multi-threads to write a single file. On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote: Hi, LineRecordWriter.write(..) is synchronized. I did not find any other RecordWriter implementations define the write as synchronized. Any specific reason for this. regards, sathwik -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Why LineRecordWriter.write(..) is synchronized
I would say yes make this a Jira. The actual change can fall (as proposed by Jay) in two directions: Put in synchronization in all implementations OR take it out of all implementations. I think the first thing to determine is why the synchronization was put into the LineRecordWriter in the first place. https://github.com/apache/hadoop-common/blame/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/lib/output/TextOutputFormat.java The oldest I have been able to find is a commit on 2009-05-18 for HADOOP-4687 that is about moving stuff around (i.e. this code is even older than that). Niels On Thu, Aug 8, 2013 at 2:21 PM, Sathwik B P sath...@apache.org wrote: Hi Harsh, Do you want me to raise a Jira on this. regards, sathwik On Thu, Aug 8, 2013 at 5:23 PM, Jay Vyas jayunit...@gmail.com wrote: Then is this a bug? Synchronization in absence of any race condition is normally considered bad. In any case id like to know why this writer is synchronized whereas the other one are not.. That is, I think, then point at issue: either other writers should be synchronized or else this one shouldn't be - consistency across the write implementations is probably desirable so that changes to output formats or record writers don't lead to bugs in multithreaded environments . On Aug 8, 2013, at 6:50 AM, Harsh J ha...@cloudera.com wrote: While we don't fork by default, we do provide a MultithreadedMapper implementation that would require such synchronization. But if you are asking is it necessary, then perhaps the answer is no. On Aug 8, 2013 3:43 PM, Azuryy Yu azury...@gmail.com wrote: its not hadoop forked threads, we may create a line record writer, then call this writer concurrently. On Aug 8, 2013 4:00 PM, Sathwik B P sathwik...@gmail.com wrote: Hi, Thanks for your reply. May I know where does hadoop fork multiple threads to use a single RecordWriter. regards, sathwik On Thu, Aug 8, 2013 at 7:06 AM, Azuryy Yu azury...@gmail.com wrote: because we may use multi-threads to write a single file. On Aug 8, 2013 2:54 PM, Sathwik B P sath...@apache.org wrote: Hi, LineRecordWriter.write(..) is synchronized. I did not find any other RecordWriter implementations define the write as synchronized. Any specific reason for this. regards, sathwik -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Is there any way to use a hdfs file as a Circular buffer?
A circular file on hdfs is not possible. Some of the ways around this limitation: - Create a series of files and delete the oldest file when you have too much. - Put the data into an hbase table and do something similar. - Use completely different technology like mongodb which has built in support for a circular buffer (capped collection). Niels Hi all, Is there any way to use a hdfs file as a Circular buffer? I mean, if I set a quotas to a directory on hdfs, and writting data to a file in that directory continuously. Once the quotas exceeded, I can redirect the writter and write the data from the beginning of the file automatically .
Use a URL for the HADOOP_CONF_DIR?
Hi, When giving users access to an Hadoop cluster they need a few XML config files (like the hadoop-site.xml ). They put these somewhere on they PC and start running their jobs on the cluster. Now when you're changing the settings you want those users to use (for example you changed some tcpport) you need them all to update their config files. My question is: Can you set the HADOOP_CONF_DIR to be a URL on a webserver? A while ago I tried this and (back then) it didn't work. Would this be a useful enhancement? -- Best regards, Niels Basjes
Running a single cluster in multiple datacenters
Hi, Last week we had a discussion at work regarding setting up our new Hadoop cluster(s). One of the things that has changed is that the importance of the Hadoop stack is growing so we want to be more available. One of the points we talked about was setting up the cluster in such a way that the nodes are physically located in two separate datacenters (on opposite sides of the same city) with a big network connection in between. We're currently talking about a cluster in the 50 nodes range, but that will grow over time. The advantages I see: - More CPU power available for jobs. - The data is automatically copied between the datacenters as long as we configure them to be different 'racks'. The disadvantages I see: - If the network goes out then one half is dead and the other half will most likely go to safemode because the recovering of the missing replicas will fill up the disks fast. What things should we consider also? Has anyone any experience with such a setup? Is it a good idea to do this? What are better options for us to consider? Thanks for any input. -- Best regards, Niels Basjes
Re: Inputformat
If you try to hammer in a nail (json file) with a screwdriver ( XMLInputReader) then perhaps the reason it won't work may be that you are using the wrong tool? On Jun 21, 2013 11:38 PM, jamal sasha jamalsha...@gmail.com wrote: Hi, I am using one of the libraries which rely on InputFormat. Right now, it is reading xml files spanning across mutiple lines. So currently the input format is like: public class XMLInputReader extends FileInputFormatLongWritable, Text { public static final String START_TAG = page; public static final String END_TAG = /page; @Override public RecordReaderLongWritable, Text getRecordReader(InputSplit split, JobConf conf, Reporter reporter) throws IOException { conf.set(XMLInputFormat.START_TAG_KEY, START_TAG); conf.set(XMLInputFormat.END_TAG_KEY, END_TAG); return new XMLRecordReader((FileSplit) split, conf); } } So, in above if the data is like: page soemthing \n somthing \n /page It process this sort of data.. Now, i want to use the same framework but for json files but lasting just single line.. So I guess my my START_TAG can be { Will my END_TAG be }\n it can't be } as there can be nested json in this data? Any clues Thanks
Re: gz containing null chars?
My best guess is that at a low level a string is often terminated by having a null byte at the end. Perhaps that's where the difference lies. Perhaps the gz decompressor simply stops at the null byte and the basic record reader that follows simply continues. In this situation your input file contains bytes that should not occur in an ASCII file (like the json file you have) and as such you can expect the unexpected ;) Niels On Jun 10, 2013 7:24 PM, William Oberman ober...@civicscience.com wrote: I posted this to the pig mailing list, but it might be more related to hadoop itself, I'm not sure. Quick recap: I had a file of \n separated lines of JSON. I decided to compress it to save on storage costs. After compression I got a different answer for a pig query that basically == count lines. After a lot of digging, I found an input file that had a line that is a huge block of null characters followed by a \n. I wrote scripts to examine the file directly, and if I stop counting at the weird line, I get the same count as what pig claims for that file. If I count all lines (e.g. don't fail at the corrupt line) I get the uncompressed count pig claims. I don't know how to debug hadoop/pig quite as well, though I'm trying now. But, my working theory is that some combination of pig/hadoop aborts processing the gz stream on a null character (or something like that), but keeps chugging on a non-gz stream. Does that sound familiar or make sense to anyone? will
Re: Reducer to output only json
Have you tried something like this (i do not have a pc here to check this code) context.write(NullWritable, new Text(jsn.toString())); On Jun 4, 2013 8:10 PM, Chengi Liu chengi.liu...@gmail.com wrote: Hi, I have the following redcuer class public static class TokenCounterReducer extends ReducerText, Text, Text, Text { public void reduce(Text key, IterableText values, Context context) throws IOException, InterruptedException { //String[] fields = s.split(\t, -1) JSONObject jsn = new JSONObject(); int sum = 0; for (Text value : values) { String[] vals = value.toString().split(\t); String[] targetNodes = vals[0].toString().split(,,-1); jsn.put(source,vals[1] ); jsn.put(target,targetNodes); //sum += value.get(); } // context.write(key, new Text(sum)); } } I want to save that json to hdfs? It was very trivial in hadoop streaming.. but how do i do it in hadoop java? Thanks
Re: Experimental Hadoop Cluster - Linux Windows machines
My first suggestion is to go for CentOS as it is free and almost the same as RHEL. Also having a 64 bit OS lets you use a bit more of the installed memory Then if you can simply install CentOS on these machines and have the network running you should be fine. I have been running experiments on something identical to what you are describing here. Niels Basjes On Sat, Jun 1, 2013 at 9:47 PM, Rody BigData rodybigd...@gmail.com wrote: I have some old ( not very old - each of 4GB RAM with a decent processor etc., and working fine till now ) Dell Windows XP machines and want to convert them to a Red Hat Enterprise Linux (RHEL) for a Hadoop cluster for my experimental purposes. Can you give me some suggestions on how to proceed on this plan? My initial plan is to target 4 to 5 machines. Does the hardware on which Windows XP machine is based supports RHEL? Are there any other drivers or software should we install before installing RHEL on this machines? How easy/difficult is this, if some one like a Linux Systems Administrator is involved in this? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Experimental Hadoop Cluster - Linux Windows machines
I've installed CentOS on several different types of old (originally Windows XP) Dell desktops for the last 4 years (i.e. desktops as old as 7 years ago) and so far installing CentOS was as easy as booting from the installation CD/DVD and doing next, next, finish. The only thing that you may run into is that the specific cpu really doesn't support 64bit. You'll get an error very early in the installation process. Niels On Jun 1, 2013 10:10 PM, Rody BigData rodybigd...@gmail.com wrote: So what is the procedure for installing CentOS on a Windows XP machines ( want to format XP ) . Is it a complicated procedure? Some kind of drivers and issues, is it common to expect ? On Sat, Jun 1, 2013 at 4:05 PM, Marco Shaw marco.s...@gmail.com wrote: If you're running XP, be aware of the others suggesting 64-bit. That depends on the exact proc you're running. You sort of need to break this down into first determining how to get Linux on your systems. RHEL is pretty costly for a test, unless you've got that covered. Go with CentOS for a proof of concept. Get that covered first, then move on to determining how to get Hadoop on them. Marco On Sat, Jun 1, 2013 at 4:47 PM, Rody BigData rodybigd...@gmail.comwrote: I have some old ( not very old - each of 4GB RAM with a decent processor etc., and working fine till now ) Dell Windows XP machines and want to convert them to a Red Hat Enterprise Linux (RHEL) for a Hadoop cluster for my experimental purposes. Can you give me some suggestions on how to proceed on this plan? My initial plan is to target 4 to 5 machines. Does the hardware on which Windows XP machine is based supports RHEL? Are there any other drivers or software should we install before installing RHEL on this machines? How easy/difficult is this, if some one like a Linux Systems Administrator is involved in this?
Re: Configuring SSH - is it required? for a psedo distriburted mode?
I never configure the ssh feature. Not for running on a single node and not for a full size cluster. I simply start all the required deamons (name/data/job/task) and configure them on which ports each can be reached. Niels Basjes On May 16, 2013 4:55 PM, Raj Hadoop hadoop...@yahoo.com wrote: Hi, I have a dedicated user on Linux server for hadoop. I am installing it in psedo distributed mode on this box. I want to test my programs on this machine. But i see that in installation steps - they were mentioned that SSH needs to be configured. If it is single node, I dont require it ...right? Please advise. I was looking at this site http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ It menionted like this - Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section. Thanks, Raj
Re: how to get the time of a hadoop cluster, v0.20.2
If you make sure that everything uses NTP then this becomes an irrelevant distinction. On Thu, May 16, 2013 at 4:01 PM, Jane Wayne jane.wayne2...@gmail.comwrote: yes, but that gets the current time on the server, not the hadoop cluster. i need to be able to probe the date/time of the hadoop cluster. On Tue, May 14, 2013 at 5:09 PM, Niels Basjes ni...@basjes.nl wrote: I made a typo. I meant API (instead of SPI). Have a look at this for more information: http://stackoverflow.com/questions/833768/java-code-for-getting-current-time If you have a client that is not under NTP then that should be the way to fix your issue. Once you have that getting the current time is easy. Niels Basjes On Tue, May 14, 2013 at 5:46 PM, Jane Wayne jane.wayne2...@gmail.com wrote: niels, i'm not familiar with the native java spi. spi = service provider interface? could you let me know if this spi is part of the hadoop api? if so, which package/class? but yes, all nodes on the cluster are using NTP to synchronize time. however, the server (which is not a part of the hadoop cluster) accessing/interfacing with the hadoop cluster cannot be assumed to be using NTP. will this still make a difference? and actually, this is the primary reason why i need to get the date/time of the hadoop cluster (need to check if the date/time of the hadooop cluster is in sync with the server). On Tue, May 14, 2013 at 11:38 AM, Niels Basjes ni...@basjes.nl wrote: If you have all nodes using NTP then you can simply use the native Java SPI to get the current system time. On Tue, May 14, 2013 at 4:41 PM, Jane Wayne jane.wayne2...@gmail.com wrote: hi all, is there a way to get the current time of a hadoop cluster via the api? in particular, getting the time from the namenode or jobtracker would suffice. i looked at JobClient but didn't see anything helpful. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: how to get the time of a hadoop cluster, v0.20.2
If you have all nodes using NTP then you can simply use the native Java SPI to get the current system time. On Tue, May 14, 2013 at 4:41 PM, Jane Wayne jane.wayne2...@gmail.comwrote: hi all, is there a way to get the current time of a hadoop cluster via the api? in particular, getting the time from the namenode or jobtracker would suffice. i looked at JobClient but didn't see anything helpful. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: how to get the time of a hadoop cluster, v0.20.2
I made a typo. I meant API (instead of SPI). Have a look at this for more information: http://stackoverflow.com/questions/833768/java-code-for-getting-current-time If you have a client that is not under NTP then that should be the way to fix your issue. Once you have that getting the current time is easy. Niels Basjes On Tue, May 14, 2013 at 5:46 PM, Jane Wayne jane.wayne2...@gmail.comwrote: niels, i'm not familiar with the native java spi. spi = service provider interface? could you let me know if this spi is part of the hadoop api? if so, which package/class? but yes, all nodes on the cluster are using NTP to synchronize time. however, the server (which is not a part of the hadoop cluster) accessing/interfacing with the hadoop cluster cannot be assumed to be using NTP. will this still make a difference? and actually, this is the primary reason why i need to get the date/time of the hadoop cluster (need to check if the date/time of the hadooop cluster is in sync with the server). On Tue, May 14, 2013 at 11:38 AM, Niels Basjes ni...@basjes.nl wrote: If you have all nodes using NTP then you can simply use the native Java SPI to get the current system time. On Tue, May 14, 2013 at 4:41 PM, Jane Wayne jane.wayne2...@gmail.com wrote: hi all, is there a way to get the current time of a hadoop cluster via the api? in particular, getting the time from the namenode or jobtracker would suffice. i looked at JobClient but didn't see anything helpful. -- Best regards / Met vriendelijke groeten, Niels Basjes -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: How to process only input files containing 100% valid rows
How about a different approach: If you use the multiple output option you can process the valid lines in a normal way and put the invalid lines in a special separate output file. On Apr 18, 2013 9:36 PM, Matthias Scherer matthias.sche...@1und1.de wrote: Hi all, ** ** In my mapreduce job, I would like to process only whole input files containing only valid rows. If one map task processing an input split of a file detects an invalid row, the whole file should be “marked” as invalid and not processed at all. This input file will then be cleansed by another process, and taken again as input to the next run of my mapreduce job. ** ** My first idea was to set a counter in the mapper after detecting an invalid line with the name of the file as the counter name (derived from input split). Then additionally put the input filename to the map output value (which is already a MapWritable, so adding the filename is no problem). And in the reducer I could filter out any rows belonging to the counters written in the mapper. ** ** Each job has some thousand input files. So in the worst case there could be as many counters written to mark invalid input files. Is this a feasible approach? Does the framework guarantee that all counters written in the mappers are synchronized (visible) in the reducers? And could this number of counters lead to OOME in the jobtracker? ** ** Are there better approaches? I could also process the files using a non splitable input format. Is there a way to reject the already outputted rows of a the map task processing an input split? ** ** Thanks, Matthias ** **
Re: Can I perfrom a MR on my local filesystem
Have a look at this http://stackoverflow.com/questions/3546025/is-it-possible-to-run-hadoop-in-pseudo-distributed-operation-without-hdfs -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 17 feb. 2013 07:51 schreef Agarwal, Nikhil nikhil.agar...@netapp.com het volgende: Hi, Recently I followed a blog to run Hadoop on a single node cluster. I wanted to ask that in a single node set-up of Hadoop is it necessary to have the data copied into Hadoop’s HDFS before running a MR on it. Can I run MR on my local file system too without copying the data to HDFS? In the Hadoop source code I saw there are implementations of other file systems too like S3, KFS, FTP, etc. so how does exactly a MR happen on S3 data store ? How does JobTracker or Tasktracker run in S3 ? ** ** I would be very thankful to get a reply to this. ** ** Thanks Regards, Nikhil ** **
Re: how to find top N values using map-reduce ?
My suggestion is to use secondary sort with a single reducer. That easy you can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb. 2013 12:08 schreef praveenesh kumar praveen...@gmail.com het volgende: My actual problem is to rank all values and then run logic 1 to top n% values and logic 2 to rest values. 1st - Ranking ? (need major suggestions here) 2nd - Find top n% out of them. Then rest is covered. Regards Praveenesh On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang lakech...@gmail.com wrote: there's one thing i want to clarify that you can use multi-reducers to sort the data globally and then cat all the parts to get the top n records. The data in all parts are globally in order. Then you may find the problem is much easier. 在 2013-2-2 下午3:18,praveenesh kumar praveen...@gmail.com写道: Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB which could be rows worth of 1 GB (roughly) in my mappers. I think that would not be possible given my input splits are of 64/128/512 MB (based on my block size) or am I making wrong assumptions. I can increase the inputsplit size, but is there a better way to find top n%. My whole actual problem is to give ranks to some values and then find out the top 10 ranks. I think this context can give more idea about the problem ? Regards Praveenesh On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com wrote: Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote: I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov http://jkff.info/software/timeplotters - my performance visualization tools
Re: how to find top N values using map-reduce ?
My suggestion is to use secondary sort with a single reducer. That easy you can easily extract the top N. If you want to get the top N% you'll need an additional phase to determine how many records this N% really is. -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 2 feb. 2013 12:08 schreef praveenesh kumar praveen...@gmail.com het volgende: My actual problem is to rank all values and then run logic 1 to top n% values and logic 2 to rest values. 1st - Ranking ? (need major suggestions here) 2nd - Find top n% out of them. Then rest is covered. Regards Praveenesh On Sat, Feb 2, 2013 at 1:42 PM, Lake Chang lakech...@gmail.com wrote: there's one thing i want to clarify that you can use multi-reducers to sort the data globally and then cat all the parts to get the top n records. The data in all parts are globally in order. Then you may find the problem is much easier. 在 2013-2-2 下午3:18,praveenesh kumar praveen...@gmail.com写道: Actually what I am trying to find to top n% of the whole data. This n could be very large if my data is large. Assuming I have uniform rows of equal size and if the total data size is 10 GB, using the above mentioned approach, if I have to take top 10% of the whole data set, I need 10% of 10GB which could be rows worth of 1 GB (roughly) in my mappers. I think that would not be possible given my input splits are of 64/128/512 MB (based on my block size) or am I making wrong assumptions. I can increase the inputsplit size, but is there a better way to find top n%. My whole actual problem is to give ranks to some values and then find out the top 10 ranks. I think this context can give more idea about the problem ? Regards Praveenesh On Sat, Feb 2, 2013 at 11:53 AM, Eugene Kirpichov ekirpic...@gmail.com wrote: Hi, Can you tell more about: * How big is N * How big is the input dataset * How many mappers you have * Do input splits correlate with the sorting criterion for top N? Depending on the answers, very different strategies will be optimal. On Fri, Feb 1, 2013 at 9:05 PM, praveenesh kumar praveen...@gmail.comwrote: I am looking for a better solution for this. 1 way to do this would be to find top N values from each mappers and then find out the top N out of them in 1 reducer. I am afraid that this won't work effectively if my N is larger than number of values in my inputsplit (or mapper input). Otherway is to just sort all of them in 1 reducer and then do the cat of top-N. Wondering if there is any better approach to do this ? Regards Praveenesh -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov http://jkff.info/software/timeplotters - my performance visualization tools
Re: What is the preferred way to pass a small number of configuration parameters to a mapper or reducer
F. put a mongodb replica set on all hadoop workernodes and let the tasks query the mongodb at localhost. (this is what I did recently with a multi GiB dataset) -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 30 dec. 2012 20:01 schreef Jonathan Bishop jbishop@gmail.com het volgende: E. Store them in hbase... On Sun, Dec 30, 2012 at 12:24 AM, Hemanth Yamijala yhema...@thoughtworks.com wrote: If it is a small number, A seems the best way to me. On Friday, December 28, 2012, Kshiva Kps wrote: Which one is current .. What is the preferred way to pass a small number of configuration parameters to a mapper or reducer? *A. *As key-value pairs in the jobconf object. * * *B. *As a custom input key-value pair passed to each mapper or reducer. * * *C. *Using a plain text file via the Distributedcache, which each mapper or reducer reads. * * *D. *Through a static variable in the MapReduce driver class (i.e., the class that submits the MapReduce job). *Answer: B*
Re: Doubts on compressed file
Hi, If a zip file(Gzip) is loaded into HDFS will it get splitted into Blocks and store in HDFS? Yes. I understand that a single mapper can work with GZip as it reads the entire file from beginning to end... In that case if the GZip file size is larget than 128 MB will it get splitted into blocks and stored in HDFS? Yes, and then the mapper will read the other parts of the file over the network. So what I do is I upload such files with a bigger HDFS blocksize so the mapper has the entire file locally. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Hadoop Real time help
Thanks for the pointers, I have stuff to read now :) On Mon, Aug 20, 2012 at 9:37 AM, Bertrand Dechoux decho...@gmail.com wrote: The terms are * ESP : http://en.wikipedia.org/wiki/Event_stream_processing * CEP : http://en.wikipedia.org/wiki/Complex_event_processing By the way, processing streams in real time tends toward being a pleonasm. MapReduce follows a batch architecture. You keep data until a given time. You then process everything. And at the end you provide all the results. Stream processing has by definition a more 'smooth' throughput. Each event is processed at a time and potentially each processing could lead to a result. I don't know any complete overview of such tools. Esper is well known in that space. FlumeBase was an attempt to do something similar (as far as I can tell). It shows how an ESP engine fits with log collection using a tool such as Flume. Then you also have other solutions which will allow you to scale such as Storm. A few people have already considered using Storm for scalability and Esper to do the real computation. Regards Bertrand On Sun, Aug 19, 2012 at 9:44 PM, Niels Basjes ni...@basj.es wrote: Is there a complete overview of the tools that allow processing streams of data in realtime? Or even better; what are the terms to google for? -- Met vriendelijke groet, Niels Basjes (Verstuurd vanaf mobiel ) Op 19 aug. 2012 18:22 schreef Bertrand Dechoux decho...@gmail.com het volgende: That's a good question. More and more people are talking about Hadoop Real Time. One key aspect of this question is whether we are talking about MapReduce or not. MapReduce greatly improves the response time of any data intensive jobs but it is still a batch framework with a noticeable latency. There is multiple ways to improve the latency : * ESP/CEP solutions (like Esper, FlumeBase, ...) * Big Table clones (like HBase ...) * YARN with a non MapReduce application * ... But it will really depend on the context and the definition of 'real time'. Regards Bertrand On Sun, Aug 19, 2012 at 5:44 PM, mahout user mahoutu...@gmail.com wrote: Hello folks, I am new to hadoop, I just want to get information that how hadoop framework is usefull for real time service.?can any one explain me..? Thanks. -- Bertrand Dechoux -- Bertrand Dechoux -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: output/input ratio 1 for map tasks?
Hi, On Mon, Jul 30, 2012 at 8:47 PM, brisk mylinq...@gmail.com wrote: Does anybody know if there are some cases where the output/input ratio for map tasks is larger than 1? I can just think of for the sort, it's 1 and for the search job it's usually smaller than 1... For a simple example: Have a look at the WordCount example. Input of a single map call is 1 record: This is a line Output are 4 records: This1 is 1 a1 line 1 -- Best regards / Met vriendelijke groeten, Niels Basjes
Making gzip splittable for Hadoop
Hi, In many Hadoop production environments you get gzipped files as the raw input. Usually these are Apache HTTPD logfiles. When putting these gzipped files into Hadoop you are stuck with exactly 1 map task per input file. In many scenarios this is fine. However when doing a lot of work in this very first map task it may very well be advantageous to dividing the work over multiple tasks, even if there is a penalty for this scaling out. I've created an addon for Hadoop that makes this possible. I've reworked the patch I initially created to be included in hadoop (see HADOOP-7076). It can now be used by simply adding a jar file to the classpath of an existing Hadoop installation. I put the code on github ( https://github.com/nielsbasjes/splittablegzip ) and (for now) the description on my homepage: http://niels.basjes.nl/splittable-gzip This feature only works with Hadoop 0.21 and up (I tested it with Cloudera CDH4b1). So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823). Running mvn package automatically generates an RPM on my CentOS system. Have fun with it an let me know what you think. -- Best regards / Met vriendelijke groeten, Niels Basjes
Making gzip splittable for Hadoop
Hi, In many Hadoop production environments you get gzipped files as the raw input. Usually these are Apache HTTPD logfiles. When putting these gzipped files into Hadoop you are stuck with exactly 1 map task per input file. In many scenarios this is fine. However when doing a lot of work in this very first map task it may very well be advantageous to dividing the work over multiple tasks, even if there is a penalty for this scaling out. I've created an addon for Hadoop that makes this possible. I've reworked the patch I initially created to be included in hadoop (see HADOOP-7076). It can now be used by simply adding a jar file to the classpath of an existing Hadoop installation. I put the code on github ( https://github.com/nielsbasjes/splittablegzip ) and (for now) the description on my homepage: http://niels.basjes.nl/splittable-gzip This feature only works with Hadoop 0.21 and up (I tested it with Cloudera CDH4b1). So for now Hadoop 1.x is not yet supported (waiting for HADOOP-7823). Running mvn package automatically generates an RPM on my CentOS system. Have fun with it an let me know what you think. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Merge sorting reduce output files
Hi, On Thu, Mar 1, 2012 at 00:07, Robert Evans ev...@yahoo-inc.com wrote: Sorry it has taken me so long to respond. Today has been a very crazy day. No worries. I am just guessing what your algorithm is for auto-complete. What we have has a lot more features. Yet the basic idea of what we have is similar enough to what you describe for this discussion. If we want the keys to come out in sorted order, we need to have a sequence file with the partition keys for the total order partitioner. TeraSort generates a partition file by getting This only really works for Terasort because it assumes that all of the partitions are more or less random already. And that is something I don't have. This is the case for the output of a typical map/reduce job where the reduce does not change the keys passed in and the output of the reducer is less then a block in size. That sure sounds like what wordcount does to me. The only real way to get around that is to do it as part of a map/reduce job, and do some random sampling instead of reading the first N. It should be a map/reduce job because it is going to be reading a lot more data then TeraSort’s partition generation code. In this case you would have a second M/R job that runs after the first and randomly samples words/phrases to work on. It would then generate the increasing long phrases and send them all to a single reducer that would buffer them up, and when the Reducer has no more input it would output every Nth key so that you get the proper number of partitions for the Reducers. You could sort these keys yourself to be sure, but they should come in in sorted order so why bother resorting. If my assumptions are totally wrong here please let me know. I've had a discussion with some coworkers and we came to a possible solution that is very closely related to your idea. Because this is a job that runs periodically we think we can assume the distribution of the dataset will have a similar shape from one run to the next. If this assumption holds we can: 1) Create a job that takes the output of run 1 and create a aggregate that can be used to partition the dataset 2) Use the partitioning dataset from '1)' to distribute the processing for the next run. Thanks for your suggestions. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Merge sorting reduce output files
Robert, On Tue, Feb 28, 2012 at 23:28, Robert Evans ev...@yahoo-inc.com wrote: I am not sure I can help with that unless I know better what “a special distribution” means. The thing is that this application is a Auto Complete feature that has a key that is the letters that have been typed so far. Now for several reasons we need this to be sorted by length of the input. So the '1 letter suggestions' first, then the '2 letter suggestions' etc. I've been trying to come up with an automatic partitioning that would split the dataset into something like 30 parts that when concatenated do what you suggest. Unless you are doing a massive amount of processing in your reducer having a partition that is only close to balancing the distribution is a big win over all of the other options that put the data on a single machine and sort it there. Even if you are doing a lot of processing in the reducer, or you need a special grouping to make the reduce work properly having a second map/reduce job to sort the data that is just close to balancing I would suspect would beat out all of the other options. Thanks, this is a useful suggestion. I'll see if there is a pattern in the data and from there simply manual define the partitions based on the pattern we find. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Should splittable Gzip be a core hadoop feature?
Hi, On Wed, Feb 29, 2012 at 13:10, Michel Segel michael_se...@hotmail.comwrote: Let's play devil's advocate for a second? I always like that :) Why? Because then datafiles from other systems (like the Apache HTTP webserver) can be processed without preprocessing more efficiently. Snappy exists. Compared to gzip: Snappy is faster, compresses a bit less and is unfortunately not splittable. The only advantage is that you don't have to convert from gzip to snappy and can process gzip files natively. Yes, that and the fact that the files are smaller. Note that I've described some of these considerations in the javadoc. Next question is how large are the gzip files in the first place? I work for the biggest webshop in the Netherlands and I'm facing a set of logfiles that are very often 1 GB each and are gzipped. The first thing we do with then is parse and disect each line in the very first mapper. Then we store the result in (snappy compressed) avro files. I don't disagree, I just want to have a solid argument in favor of it... :) -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Should splittable Gzip be a core hadoop feature?
Hi, On Wed, Feb 29, 2012 at 16:52, Edward Capriolo edlinuxg...@gmail.comwrote: ... But being able to generate split info for them and processing them would be good as well. I remember that was a hot thing to do with lzo back in the day. The pain of once overing the gz files to generate the split info is detracting but it is nice to know it is there if you want it. Note that the solution I created (HADOOP-7076) does not require any preprocessing. It can split ANY gzipped file as-is. The downside is that this effectively costs some additional performance because the task has to decompress the first part of the file that is to be discarded. The other two ways of splitting gzipped files either require - creating come kind of compression index before actually using the file (HADOOP-6153) - creating a file in a format that is gerenated in such a way that it is really a set of concatenated gzipped files. (HADOOP-7909) -- Best regards / Met vriendelijke groeten, Niels Basjes
Should splittable Gzip be a core hadoop feature?
Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put the javadoc I created on my homepage so you can read more about the details. http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec Now the question that was raised by one of the people reviewing this code was: Should this implementation be part of the core Hadoop feature set? The main reason that was given is that this needs a bit more understanding on what is happening and as such cannot be enabled by default. I would like to hear from the Hadoop Core/Map reduce users what you think. Should this be - a part of the default Hadoop feature set so that anyone can simply enable it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes
Merge sorting reduce output files
Hi, We have a job that outputs a set of files that are several hundred MB of text each. Using the comparators and such we can produce output files that are each sorted by themselves. What we want is to have one giant outputfile (outside of the cluster) that is sorted. Now we see the following options: 1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long). 2) Create an additional job that sorts the existing files and has 1 reducer. 3) Download all of the files and run the standard commandline tool sort -m 4) Install HDFS fuse and run the standard commandline tool sort -m 5) Create an hadoop specific tool that can do hadoop fs -text and sort -m in one go. During our discussion we were wondering: What is the best way of doing this? What do you recommend? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Merge sorting reduce output files
Hi Robert, On Tue, Feb 28, 2012 at 21:41, Robert Evans ev...@yahoo-inc.com wrote: I would recommend that you do what terrasort does and use a different partitioner, to ensure that all keys within a given range will go to a single reducer. If your partitioner is set up correctly then all you have to do is to concatenate the files together, if you even need to do that. Look at TotalOrderPartitioner. It should do what you want. I know about that partitioner. The trouble I have is comming up with a partitioning that evenly balances the data for this specific problem. Taking a sample and base the partitioning on that (like the one used in terrasort) wouldn't help. The data has a special distribution... Niels Basjes --Bobby Evans On 2/28/12 2:10 PM, Niels Basjes ni...@basjes.nl wrote: Hi, We have a job that outputs a set of files that are several hundred MB of text each. Using the comparators and such we can produce output files that are each sorted by themselves. What we want is to have one giant outputfile (outside of the cluster) that is sorted. Now we see the following options: 1) Run the last job with 1 reducer. This is not really an option because that would put a significant part of the processing time through 1 cpu (this would take too long). 2) Create an additional job that sorts the existing files and has 1 reducer. 3) Download all of the files and run the standard commandline tool sort -m 4) Install HDFS fuse and run the standard commandline tool sort -m 5) Create an hadoop specific tool that can do hadoop fs -text and sort -m in one go. During our discussion we were wondering: What is the best way of doing this? What do you recommend? -- Best regards / Met vriendelijke groeten, Niels Basjes
Should splittable Gzip be a core hadoop feature?
Hi, Some time ago I had an idea and implemented it. Normally you can only run a single gzipped input file through a single mapper and thus only on a single CPU core. What I created makes it possible to process a Gzipped file in such a way that it can run on several mappers in parallel. I've put the javadoc I created on my homepage so you can read more about the details. http://howto.basjes.nl/hadoop/javadoc-for-skipseeksplittablegzipcodec Now the question that was raised by one of the people reviewing this code was: Should this implementation be part of the core Hadoop feature set? The main reason that was given is that this needs a bit more understanding on what is happening and as such cannot be enabled by default. I would like to hear from the Hadoop Core/Map reduce users what you think. Should this be - a part of the default Hadoop feature set so that anyone can simply enable it by setting the right configuration? - a separate library? - a nice idea I had fun building but that no one needs? - ... ? -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: How to Create an effective chained MapReduce program.
Hi, In the past i've had the same situation where I needed the data for debugging. Back then I chose to create a second job with simply SequenceFileInputFormat, IdentityMapper, IdentityReducer and finally TextOutputFormat. In my situation that worked great for my purpose. -- Met vriendelijke groet, Niels Basjes Op 6 sep. 2011 01:54 schreef ilyal levin nipponil...@gmail.com het volgende: o.k , so now i'm using SequenceFileInputFormat and SequenceFileOutputFormat and it works fine but the output of the reducer is now a binary file (not txt) so i can't understand the data. how can i solve this? i need the data (in txt form ) of the Intermediate stages in the chain. Thanks On Tue, Sep 6, 2011 at 1:33 AM, ilyal levin nipponil...@gmail.com wrote: Thanks for the help. On Mon, Sep 5, 2011 at 10:50 PM, Roger Chen rogc...@ucdavis.edu wrote: The binary file will allow you to pass the output from the first reducer to the second mapper. For example, if you outputed Text, IntWritable from the first one in SequenceFileOutputFormat, then you are able to retrieve Text, IntWritable input at the head of the second mapper. The idea of chaining is that you know what kind of output the first reducer is going to give already, and that you want to perform some secondary operation on it. One last thing on chaining jobs: it's often worth looking to see if you can consolidate all of your separate map and reduce tasks into a single map/reduce operation. There are many situations where it is more intuitive to write a number of map/reduce operations and chain them together, but more efficient to have just a single operation. On Mon, Sep 5, 2011 at 12:21 PM, ilyal levin nipponil...@gmail.com wrote: Thanks for the reply. I tried it but it creates a binary file which i can not understand (i need the result of the first job). The other thing is how can i use this file in the next chained mapper? i.e how can i retrieve the keys and the values in the map function? Ilyal On Mon, Sep 5, 2011 at 7:41 PM, Joey Echeverria j...@cloudera.com wrote: Have you tried SequenceFileOutputFormat and SequenceFileInputFormat? -Joey On Mon, Sep 5, 2011 at 11:49 AM, ilyal levin nipponil...@gmail.com wrote: Hi I'm trying to write a chained mapreduce program. i'm doing so with a simple loop where in each iteration i create a job ,execute it and every time the current job's output is the next job's input. how can i configure the outputFormat of the current job and the inputFormat of the next job so that i will not use the TextInputFormat (TextOutputFormat), because if i do use it, i need to parse the input file in the Map function? i.e if possible i want the next job to consider the input file as key,value and not plain Text. Thanks a lot. -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Roger Chen UC Davis Genome Center
Re: Excuting a shell script inside the HDFS
Yes, that way it could work. I'm just wondering ... Why would you want to have a script like this in HDFS? Met vriendelijk groet, Niels Basjes Op 16 aug. 2011 06:49 schreef Friso van Vollenhoven fvanvollenho...@xebia.com het volgende: hadoop fs -cat /path/on/hdfs/script.sh | bash Should work (if you use bash). Does anyone have access to HDFS or do you use HDFS permissions? Executing arbitrary stuff that comes out of a network filesystem is not always a good idea from a security perspective. Friso On 15 aug. 2011, at 23:51, Kadu canGica Eduardo wrote: Hi, i'm trying without success to execute a shell script that is inside the HDFS. Is this possible? If it is, how can i do this? Since ever, thanks. Carlos.
Re: How to select random n records using mapreduce ?
The only solution I can think of is by creating a counter in Hadoop that is incremented each time a mapper lets a record through. As soon as the value reaches a preselected value the mappers simply discard the additional input they receive. Note that this will not at all be random yet it's the best I can come up with right now. HTH On Mon, Jun 27, 2011 at 09:11, Jeff Zhang zjf...@gmail.com wrote: Hi all, I'd like to select random N records from a large amount of data using hadoop, just wonder how can I archive this ? Currently my idea is that let each mapper task select N / mapper_number records. Does anyone has such experience ? -- Best Regards Jeff Zhang -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: AW: How to split a big file in HDFS by size
Hi, On Tue, Jun 21, 2011 at 16:14, Mapred Learn mapred.le...@gmail.com wrote: The problem is when 1 text file goes on HDFS as 60 GB file, one mapper takes more than an hour to convert it to sequence file and finally fails. I was thinking how to split it from the client box before uploading to HDFS. Have a look at this: http://stackoverflow.com/questions/3960651/splitting-gzipped-logfiles-without-storing-the-ungzipped-splits-on-disk If I read file and split it with filestream. Read() based on size, it takes 2 hours to process 1 60 gb file and upload on HDFS as 120 500 mb files. Sent from my iPhone On Jun 21, 2011, at 2:57 AM, Evert Lammerts evert.lamme...@sara.nl wrote: What we did was on not-hadoop hardware. We streamed the file from a storage cluster to a single machine and cut it up while streaming the pieces back to the storage cluster. That will probably not work for you, unless you have the hardware for it. But then still its inefficient. You should be able to unzip your file in a MR job. If you still want to use compression you can install LZO and rezip the file from within the same job. (LZO uses block-compression, which allows Hadoop to process all blocks in parallel.) Note that you’ll need enough storage capacity. I don’t have example code, but I’m guessing Google can help. From: Mapred Learn [mailto:mapred.le...@gmail.com] Sent: maandag 20 juni 2011 18:09 To: Niels Basjes; Evert Lammerts Subject: Re: AW: How to split a big file in HDFS by size Thanks for sharing. Could you guys share how are you divinding your 2.7 TB into 10 Gb file each on HDFS ? That wud be helpful for me ! On Mon, Jun 20, 2011 at 8:39 AM, Marcos Ortiz mlor...@uci.cu wrote: Evert Lammerts at Sara.nl did something seemed to your problem, spliting a big 2.7 TB file to chunks of 10 GB. This work was presented on the BioAssist Programmers' Day on January of this year and its name was Large-Scale Data Storage and Processing for Scientist in The Netherlands http://www.slideshare.net/evertlammerts P.D: I sent the message with a copy to him El 6/20/2011 10:38 AM, Niels Basjes escribió: Hi, On Mon, Jun 20, 2011 at 16:13, Mapred Learnmapred.le...@gmail.com wrote: But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ? Yes, that is very true. -- Marcos Luís Ortíz Valmaseda Software Engineer (UCI) http://marcosluis2186.posterous.com http://twitter.com/marcosluis2186 -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: AW: How to split a big file in HDFS by size
Hi, On Mon, Jun 20, 2011 at 16:13, Mapred Learn mapred.le...@gmail.com wrote: But this file is a gzipped text file. In this case, it will only go to 1 mapper than the case if it was split into 60 1 GB files which will make map-red job finish earlier than one 60 GB file as it will Hv 60 mappers running in parallel. Isn't it so ? Yes, that is very true. -- Best regards / Met vriendelijke groeten, Niels Basjes
Re: Using df instead of du to calculate datanode space
Hi, Although I like the thought of doing things smarter I'm never ever going to change core Unix/Linux applications for the sake of a specific application. Linux handles scripts and binaries completely different with regards to security. So how do you know for sure (I mean 100% sure, not just 99.% sure) that you haven't broken any other functionality needed to keep your system sane? Why don't you issue a feature request so this needless disk io can be fixed as part of the base code of Hadoop (instead of breaking the underlying OS)? Niels 2011/5/21 Edward Capriolo edlinuxg...@gmail.com: Good job. I brought this up an another thread, but was told it was not a problem. Good thing I'm not crazy. On Sat, May 21, 2011 at 12:42 AM, Joe Stein charmal...@allthingshadoop.comwrote: I came up with a nice little hack to trick hadoop into calculating disk usage with df instead of du http://allthingshadoop.com/2011/05/20/faster-datanodes-with-less-wait-io-using-df-instead-of-du/ I am running this in production, works like a charm and already seeing benefit, woot! I hope it works well for others too. /* Joe Stein http://www.twitter.com/allthingshadoop */ -- Met vriendelijke groeten, Niels Basjes
Including external libraries in my job.
Hi, I've written my first very simple job that does something with hbase. Now when I try to submit my jar in my cluster I get this: [nbasjes@master ~/src/catalogloader/run]$ hadoop jar catalogloader-1.0-SNAPSHOT.jar nl.basjes.catalogloader.Loader /user/nbasjes/Minicatalog.xml Exception in thread main java.lang.NoClassDefFoundError: org/apache/hadoop/hbase/HBaseConfiguration at nl.basjes.catalogloader.Loader.main(Loader.java:156) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) ... I've found this blog post that promises help http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ Quote: 1. Include the JAR in the “-libjars” command line option of the `hadoop jar …` command. The jar will be placed in distributed cache and will be made available to all of the job’s task attempts. However one of the comments states: Unfortunately, method 1 only work before 0.18, it doesn’t work in 0.20. Indeed, I can't get it to work this way. I've tried something as simple as: export HADOOP_CLASSPATH=/usr/lib/hbase/hbase-0.90.1-cdh3u0.jar:/usr/lib/zookeeper/zookeeper-3.3.3-cdh3u0.jar and then run the job but that (as expected) simply means the tasks on the processing nodes fail with a similar error: java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:996) at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:248) at org.apache.hadoop.mapred.Task.initialize(Task.java:486) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) ... So what is the correct way of doing this? -- Met vriendelijke groeten, Niels Basjes
Unsplittable files on HDFS
Hi, In some scenarios you have gzipped files as input for your map reduce job (apache logfiles is a common example). Now some of those files are several hundred megabytes and as such will be split by HDFS in several blocks. When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2) Total number of blocks: 2 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 As you can see the file has been distributed over all 4 nodes. When actually reading those files they are unsplittable due to the nature of the Gzip codec. So a job will (in the above example) ALWAYS need to pull the other half of the file over the network, if a file is bigger and the cluster is bigger then the percentage of the file that goes over the network will probably increase. Now if I can tell HDFS that a .gz file should always be 100% local for the node that will be doing the processing this would reduce the network IO during the job dramatically. Especially if you want to run several jobs against the same input. So my question is: Is there a way to force/tell HDFS to make sure that a datanode that has blocks of this file must always have ALL blocks of this file? -- Best regards, Niels Basjes
Re: Unsplittable files on HDFS
Hi, I did the following with a 1.6GB file hadoop fs -Ddfs.block.size=2147483648 -put /home/nbasjes/access-2010-11-29.log.gz /user/nbasjes and I got Total number of blocks: 1 4189183682512190568:10.10.138.61:50010 10.10.138.62:50010 Yes, that does the trick. Thank you. Niels 2011/4/27 Harsh J ha...@cloudera.com: Hey Niels, The block size is a per-file property. Would putting/creating these gzip files on the DFS with a very high block size (such that it doesn't split across for such files) be a valid solution to your problem here? On Wed, Apr 27, 2011 at 1:25 PM, Niels Basjes ni...@basjes.nl wrote: Hi, In some scenarios you have gzipped files as input for your map reduce job (apache logfiles is a common example). Now some of those files are several hundred megabytes and as such will be split by HDFS in several blocks. When looking at a real 116MiB file on HDFS I see this (4 nodes, replication = 2) Total number of blocks: 2 25063947863662497: 10.10.138.62:50010 10.10.138.61:50010 1014249434553595747: 10.10.138.64:50010 10.10.138.63:50010 As you can see the file has been distributed over all 4 nodes. When actually reading those files they are unsplittable due to the nature of the Gzip codec. So a job will (in the above example) ALWAYS need to pull the other half of the file over the network, if a file is bigger and the cluster is bigger then the percentage of the file that goes over the network will probably increase. Now if I can tell HDFS that a .gz file should always be 100% local for the node that will be doing the processing this would reduce the network IO during the job dramatically. Especially if you want to run several jobs against the same input. So my question is: Is there a way to force/tell HDFS to make sure that a datanode that has blocks of this file must always have ALL blocks of this file? -- Best regards, Niels Basjes -- Harsh J -- Met vriendelijke groeten, Niels Basjes
Re: hadoop mr cluster mode on my laptop?
Hi, You should be doing the setup for what is called Pseudo-distributed mode. Have a look at this: http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html#PseudoDistributed Niels 2011/4/18 Pedro Costa psdc1...@gmail.com: Hi, 1 - I would like to run Hadoop MR in my laptop, but with the cluster mode configuration. I've put a slave file with the following content: [code] 127.0.0.1 localhost 192.168.10.1 mylaptop [/code] the 192.168.10.1 is the IP of the machine and the mylaptop is the logical name of the address. Is this a well configured slaves file to do what I want? 2 - I was also thinking taking advantage of my CPU to put Hadoop MR as cluster mode. This is my CPU characteristics: Essentials Processor Number i5-460M # of Cores 2 # of Threads 4 Clock Speed 2.53 GHz Max Turbo Frequency 2.8 GHz Intel® Smart Cache 3 MB Instruction Set 64-bit I was thinking in running a Tasktracker in each Core, although I don't know how to do it. Any help to install Hadoop MR in cluster mode on my laptop? Thanks, -- Pedro -- Met vriendelijke groeten, Niels Basjes
Re: Small linux distros to run hadoop ?
Hi, 2011/4/15 web service wbs...@gmail.com: what is the smallest linux system/distros to run hadoop ? I would want to run small linux vms and run jobs on them. I usually use a fully stripped CentOS 5 to run cluster nodes. Works perfectly and can be fully automated using the kickstart scripting for anaconda (= Redhat installer). -- Met vriendelijke groeten, Niels Basjes
Re: Re-generate datanode storageID?
Hi, To solve that simply do the following on the problematic nodes: 1) Stop the datanode (probably not running) 2) Remove everything inside the .../cache/hdfs/ 3) Start the datanode again. Note: With cloudera always use service way to stop/start hadoop software! service hadoop-0.20-datanode stop 2011/3/24 Marc Leavitt leavitt...@gmail.com: I am setting up a (very) small Hadoop/CDH3 beta 4 cluster in virtual machines to do some initial feasibility work. I proceeded by progressing through the Cloudera documentation standalone - pseudo-cluster - cluster with a single VM and then, when I had it stable(-ish) I copied the VM to a couple of slave images. The good news is that all three tasktrackers show up. The bad news is that only one datanode shows up. And, after after some research, I am pretty sure I know why - each of the datanodes is trying to claim the same storageID (as defined in .../cache/hdfs/dfs/data/current/VERSION. So, my question is How do I resolve the collision of the storageIDs? Thanks! -mgl -- Met vriendelijke groeten, Niels Basjes
Re: TextInputFormat and Gzip encoding - wordcount displaying binary data
Hi, 2011/3/21 Saptarshi Guha saptarshi.g...@gmail.com: It's frustrating to be dealing with these simple problems (and I know the fault is mine, i'm missing something). I'm running word count (from 0.20-2) on a gzip file (very small), the output has binary characters. When I run the same on the ungzipped file, the output is correct ascii. I'm using the native gzip library. The command is hadoop jar /usr/lib/hadoop-0.20/hadoop-examples-0.20.2-CDH3B4.jar wordcount /user/sguha/tmp/o.zip /user/sguha/tmp/o.wc.zip (zip is gzip) No, .zip is pkzip and .gz is gzip. The applicable hadoop code actually chooses the decompressor on the extention of the filename. -- Niels Basjes
Re: File formats in Hadoop
And then there is the matter of how you put the data in the file. I've heard that some people write the data as protocolbuffers into the sequence file. 2011/3/19 Harsh J qwertyman...@gmail.com: Hello, On Sat, Mar 19, 2011 at 9:31 PM, Weishung Chung weish...@gmail.com wrote: I am browsing through the hadoop.io package and was wondering what other file formats are available in hadoop other than SequenceFile and TFile? Additionally, on Hadoop, there're MapFiles/SetFiles (Derivative of SequenceFiles, if you need maps/sets), and IFiles (Used by the map-output buffers to produce a key-value file for Reducers to use, internal use only). Apache Hive use RCFiles, which is very interesting too. Apache Avro provides Avro-Datafiles that are designed for use with Hadoop Map/Reduce + Avro-serialized data. I'm not sure of this one, but Pig probably was implementing a table-file-like solution of their own a while ago. Howl? -- Harsh J http://harshj.com -- Met vriendelijke groeten, Niels Basjes
Re: Efficiently partition broadly distributed keys
If I understand your problem correctly you actually need some way of knowing if you need to chop a large set with a specific key in to subsets. In mapreduce the map only has information about a single key at a time. So you need something extra. One way of handling this is to start by doing a statistical run first and use the output to determine which keys can be chopped. So first do a MR to determine per key how many records and/or bytes need to be handled. In the reducer you have a lower limit that ensures the reducer only outputs the keys that need to be chopped. In the second pass you do your actual algorithm with one twist: In the mapper you use the output of the first run to determine if the key needs to be rewritten and in how many variants. You then randomly (!) choose a variant. Example of what i mean: Assume you have 100 records with key A and 1 records with key B. You determine that you want key groups of an approximate maximum of 1000 records. The first pass outputs that key B must be split into 10 variants (=1/1000). Then the second MAP will transform a key B into one of these randomly: B-1, B-2, B-3 ... A record with key A will remain unchanged. Disadvantage: Each run will show different results. Only works if the set of keys that needs to be chopped is small enough so you can have it in memory in the call to the second map. HTH Niels Basjes 2011/3/10 Luca Aiello alu...@yahoo-inc.com: Dear users, hope this is the right list to submit this one, otherwise I apologize. I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid. I have a textual input structured in key, value pairs. My task is to make the cartesian product of all the values that have the same key. I can do so it simply using key as my map key, so that every value with the same key is put in the same reducer, where I can easily process them and obtain the cartesian. However, my keys are not uniformly distributed, the distribution is very broad. As a result of this, the output of my reducers will be very unbalanced and I will have many small files (some KB) and a bunch of huge files (tens of GB). A sub-optimal yet acceptable approximation of my task would be to make the cartesian product of smaller chunks of values for very frequent keys, so that the load is distributed evenly among reducers. I am wondering how can I do this in the most efficient/elegant way. It appears to me that using a customized Partitioner is not the right way to act, since records with the same key have still to be mapped together (am I right?). The only solution that comes into my mind is to split the key space artificially insider the mapper (e.g., for a frequent key ABC, map the values on the reducers using keys like ABC1, ABC2, an so on). This would require an additional post-processing cleanup phase to retrieve the original keys. Do you know a better, possibly automatic way to perform this task? Thank you! Best Luca -- Met vriendelijke groeten, Niels Basjes
Re: Efficiently partition broadly distributed keys
Hi Luca, 2011/3/10 Luca Aiello alu...@yahoo-inc.com: thanks for the quick response. So in your opinion there is nothing like a hadoop embedded tool to do this. This is what I suspected indeed. The mapreduce model simply uses the key as the pivot of the processing. In your application specific situation your looking for a way to break that model in a smart way. So no, Hadoop doesn't do that. Since the key, value pairs in the initial dataset are randomly partitioned in the input files, I suppose that I can avoid the initial statistic step and do statistics on the fly. For example, if I have 3 keys A,B,C with probability 0.7, 0.2, and 0.1, every mapper will receive, on average, a input composed by 70% of A keys, 20% of Bs and 10% of Cs and so it will know that the key,value to be chunked are de As. If you know this in advance you can surely put that into the the mapper. But that simply means you've done the statistics in advance in some way. I fail to see how you could do this on the fly without prior knowledge. As a slightly smarter way of doing something like this; Have a look at the terasort example. There a sample of the data is used to estimate the distribution. This can introduce some errors but it should produce a output which is quite uniformly distributed. Thanks again! You're welcome. Niels On Mar 10, 2011, at 12:23 PM, Niels Basjes wrote: If I understand your problem correctly you actually need some way of knowing if you need to chop a large set with a specific key in to subsets. In mapreduce the map only has information about a single key at a time. So you need something extra. One way of handling this is to start by doing a statistical run first and use the output to determine which keys can be chopped. So first do a MR to determine per key how many records and/or bytes need to be handled. In the reducer you have a lower limit that ensures the reducer only outputs the keys that need to be chopped. In the second pass you do your actual algorithm with one twist: In the mapper you use the output of the first run to determine if the key needs to be rewritten and in how many variants. You then randomly (!) choose a variant. Example of what i mean: Assume you have 100 records with key A and 1 records with key B. You determine that you want key groups of an approximate maximum of 1000 records. The first pass outputs that key B must be split into 10 variants (=1/1000). Then the second MAP will transform a key B into one of these randomly: B-1, B-2, B-3 ... A record with key A will remain unchanged. Disadvantage: Each run will show different results. Only works if the set of keys that needs to be chopped is small enough so you can have it in memory in the call to the second map. HTH Niels Basjes 2011/3/10 Luca Aiello alu...@yahoo-inc.com: Dear users, hope this is the right list to submit this one, otherwise I apologize. I'd like to have your opinion about a problem that I'm facing on MapReduce framework. I am writing my code in Java and running on a grid. I have a textual input structured in key, value pairs. My task is to make the cartesian product of all the values that have the same key. I can do so it simply using key as my map key, so that every value with the same key is put in the same reducer, where I can easily process them and obtain the cartesian. However, my keys are not uniformly distributed, the distribution is very broad. As a result of this, the output of my reducers will be very unbalanced and I will have many small files (some KB) and a bunch of huge files (tens of GB). A sub-optimal yet acceptable approximation of my task would be to make the cartesian product of smaller chunks of values for very frequent keys, so that the load is distributed evenly among reducers. I am wondering how can I do this in the most efficient/elegant way. It appears to me that using a customized Partitioner is not the right way to act, since records with the same key have still to be mapped together (am I right?). The only solution that comes into my mind is to split the key space artificially insider the mapper (e.g., for a frequent key ABC, map the values on the reducers using keys like ABC1, ABC2, an so on). This would require an additional post-processing cleanup phase to retrieve the original keys. Do you know a better, possibly automatic way to perform this task? Thank you! Best Luca -- Met vriendelijke groeten, Niels Basjes -- Met vriendelijke groeten, Niels Basjes
Re: Comparison between Gzip and LZO
Question: Are you 100% sure that nothing else was running on that system during the tests? No cron jobs, no makewhatis or updatedb? P.S. There is a permission issue with downloading one of the files. 2011/3/2 José Vinícius Pimenta Coletto jvcole...@gmail.com: Hi, I'm making a comparison between the following compression methods: gzip and lzo provided by Hadoop and gzip from package java.util.zip. The test consists of compression and decompression of approximately 92,000 files with an average size of 2kb, however the decompression time of lzo is twice the decompression time of gzip provided by Hadoop, it does not seem right. The results obtained in the test are: Method | Bytes | Compression | Decompression - | - | Total Time(with i/o) Time Speed | Total Time(with i/o) Time Speed Gzip (Haddop) | 200876304 | 121.454s 43.167s 4,653,424.079 B/s | 332.305s 111.806s 1,796,635.326 B/s Lzo | 200876304 | 120.564s 54.072s 3,714,914.621 B/s | 509.371s 184.906s 1,086,368.904 B/s Gzip (java.util.zip) | 200876304 | 148.014s 63.414s 3,167,647.371 B/s | 483.148s 4.528s 44,360,682.244 B/s You can see the code I'm using to the test here: http://www.linux.ime.usp.br/~jvcoletto/compression/ Can anyone explain me why am I getting these results? Thanks. -- Met vriendelijke groeten, Niels Basjes
Re: How to make a CGI with HBase?
Hi, Is there a very basic example on such a KISS interface towards a HBase table? Thanks 2011/3/1 Michael Segel michael_se...@hotmail.com: I'd say skip trying to shoe horn JDBC interface on HBase, and just write straight to it. Remember K.I.S.S. -Mike Subject: Re: How to make a CGI with HBase? From: goks...@gmail.com Date: Mon, 28 Feb 2011 21:06:52 -0800 CC: goks...@gmail.com To: common-user@hadoop.apache.org Java servlets for web development with databases is a learning curve. HBase is a learning curve. You might want to learn one at a time :) If you code the site in Java/JSPs, you want a JDBC driver for HBase. You can code calls to HBase directly in a JSP without writing Java. Lance On Feb 28, 2011, at 6:19 PM, edward choi wrote: Thanks for the reply. Didn't know that Thrift was for such purpose. Servlet and JSP is totally new to me. I skimmed through the concept on the internet and they look fascinating. I think I am gonna give servlet and JSP a try. On 2011. 3. 1., at 오전 12:51, Usman Waheed usm...@opera.com wrote: HI, I have been using the Thrift Perl API to connect to Hbase for my web app. At the moment i only perform random reads and scans based on date ranges and some other search criteria. It works and I am still testing performance. -Usman On Mon, Feb 28, 2011 at 2:37 PM, edward choi mp2...@gmail.com wrote: Hi, I am planning to make a search engine for news articles. It will probably have over several billions of news articles so I thought HBase is the way to go. However, I am very new to CGI. All I know is that you use php, python or java script with HTML to make a web site and communicate with the backend database such as MySQL. But I am going to use HBase, not MySQL, and I can't seem to find a script language that provides any form of API to communicate with HBase. So what do I do? Do I have to make a web site with pure Java? Is that even possible? It is possible, if you know things like JSP, Java servelets etc. For people comfortable with PHP or Python, I think Apache Thrift (http://wiki.apache.org/thrift/) is an alternative. -b Or is using Hbase as the backend Database a bad idea in the first place? -- Using Opera's revolutionary email client: http://www.opera.com/mail/ -- Met vriendelijke groeten, Niels Basjes
Re: When use hadoop mapreduce?
Hi, 2011/2/17 Pedro Costa psdc1...@gmail.com: I like to know, depending on my problem, when should I use or not use Hadoop MapReduce? Does exist any list that advices me to use or not to use MapReduce? The summary I usually give goes something like this: IF your computation takes too long on a single system AND you can split the work up into a lot of smaller pieces AND the work is batch oriented AND what you are doing involves a lot of data THEN you should consider trying it out. There are other systems for scalable computing (for example Gridgain) that may be better in other situations. HTH -- Met vriendelijke groeten, Niels Basjes
Re: Hadoop in Real time applications
2011/2/17 Karthik Kumar karthik84ku...@gmail.com: Can Hadoop be used for Real time Applications such as banking solutions... Hadoop consists of several components. Components like HDFS and HBase are quite suitable for interactive solutions (as in: I usually get an answer within 0.x seconds). If you really need realtime (as in: I want a guarantee that I have an answer within 0.x seconds) the answer is: No, HDFS/HBase cannot guarantee that. Other components like MapReduce (and Hive which run on top of MapReduce) are purely batch oriented. -- Met vriendelijke groeten, Niels Basjes
Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?
Hi, 2011/1/31 Sean Bigdatafun sean.bigdata...@gmail.com: GZIP is not splittable. Correct, gzip is a stream compression system which effectively means you can only start at the beginning of the data with decompressing. Does that mean a GZIP block compressed sequencefile can't take advantage of MR parallelism? AFAIK it should be splittable in the same blocks as the compression was done. How to control the size of block to be compressed in SequenceFile? Can't help you with that one. -- Met vriendelijke groeten, Niels Basjes
Re: Restricting number of records from map output
Hi, I have a sort job consisting of only the Mapper (no Reducer) task. I want my results to contain only the top n records. Is there any way of restricting the number of records that are emitted by the Mappers? Basically I am looking to see if there is an equivalent of achieving the behavior similar to LIMIT in SQL queries. I think I understand your goal. However the question is toward (what I think) is the wrong solution. A mapper gets 1 record as input and only knows about that one record. There is no way to limit there. If you implement a simple reducer you can very easily let is stop reading the input iterator after N records and limit the output in that way. Doing it in the reducer also allows you to easily add a concept of Top N by using the Secondary Sort trick to sort the input before it arrives at the reducer. HTH Niels Basjes
Re: TeraSort question.
Raj, Have a look at the graph shown here: http://cs.smith.edu/dftwiki/index.php/Hadoop_Tutorial_1.1_--_Generating_Task_Timelines It should make clear that the number of tasks varies greatly over the lifetime of a job. Depending on the nodes available this may leave node idle. Niels 2011/1/11 Raj V rajv...@yahoo.com: Ted Thanks. I have all the graphs I need that include, map reduce timeline, system activity for all the nodes when the sort was running. I will publish them once I have them in some presentable format., For legal reasons, I really don't want to send the complete job histiory files. My question is still this. When running terasort, would the CPU, disk and network utilization of all the nodes be more or less similar or completely different. Sometime during the day, I will post the system data from 5 nodes and that would probably explain my question better. Raj From: Ted Dunning tdunn...@maprtech.com To: common-user@hadoop.apache.org; Raj V rajv...@yahoo.com Cc: Sent: Tuesday, January 11, 2011 8:22:17 AM Subject: Re: TeraSort question. Raj, Do you have the job history files? That would be very useful. I would be happy to create some swimlane and related graphs for you if you can send me the history files. On Mon, Jan 10, 2011 at 9:06 PM, Raj V rajv...@yahoo.com wrote: All, I have been running terasort on a 480 node hadoop cluster. I have also collected cpu,memory,disk, network statistics during this run. The system stats are quite intersting. I can post it when I have put them together in some presentable format ( if there is interest.). However while looking at the data, I noticed something interesting. I thought, intutively, that the all the systems in the cluster would have more or less similar behaviour ( time translation was possible) but the overall graph would look the same., Just to confirm it I took 5 random nodes and looked at the CPU, disk ,network etc. activity when the sort was running. Strangeley enough, it was not so., Two of the 5 systems were seriously busy, big IO with lots of disk and network activity. The other three systems, CPU was more or less 100% idle, slight network and I/O. Is that normal and/or expected? SHouldn't all the nodes be utilized in more or less manner over the length of the run? I generated the data forf the sort using teragen. ( 128MB bloick size, replication =3). I would also be interested in other people timings of sort. Is there some place where people can post sort numbers ( not just the record.) I will post the actual graphs of the 5 nodes, if there is interest, tomorrow. ( Some logistical issues abt. posting them tonight) I am using CDH3B3, even though I think this is not specific to CDH3B3. Sorry for the cross post. Raj -- Met vriendelijke groeten, Niels Basjes
Re: Help: How to increase amont maptasks per job ?
You said you have a large amount of data. How large is that approximately? Did you compress the intermediate data (with what codec)? Niels 2011/1/7 Tali K ncherr...@hotmail.com: According to the documentation, that parameter is for the number of tasks *per TaskTracker*. I am asking about the number of tasks for the entire job and entire cluster. That parameter is already set to 3, which is one less than the number of cores on each node's CPU, as recommended.In my question I stated that 82 tasks were run for the first job, yet only 4 for the second - both numbers being cluster-wide. Date: Fri, 7 Jan 2011 13:19:42 -0800 Subject: Re: Help: How to increase amont maptasks per job ? From: yuzhih...@gmail.com To: common-user@hadoop.apache.org Set higher values for mapred.tasktracker.map.tasks.maximum (and mapred.tasktracker.reduce.tasks.maximum) in mapred-site.xml On Fri, Jan 7, 2011 at 12:58 PM, Tali K ncherr...@hotmail.com wrote: We have a jobs which runs in several map/reduce stages. In the first job, a large number of map tasks -82 are initiated, as expected. And that cause all nodes to be used. In a later job, where we are still dealing with large amounts of data, only 4 map tasks are initiated, and that caused to use only 4 nodes. This stage is actually the workhorse of the job, and requires much more processing power than the initial stage. We are trying to understand why only a few map tasks are being used, as we are not getting the full advantage of our cluster. -- Met vriendelijke groeten, Niels Basjes
Re: FILE_BYTES_WRITTEN and HDFS_BYTES_WRITTEN
For some parts of a task the system stores information on the local (non-HDFS) file system of the node that is actually running the job. That is the FILE_.. Stuff written to HDFS is the HDFS_... HTH 2010/11/30 psdc1...@gmail.com: When an hadoop MapReduce example is executed, at the end of the example it's showed a table with all the information about the execution, like the number of Map and Reduce tasks executed, the number of bytes read and written. In this information it exists 2 fields FILE_BYTES_WRITTEN and HDFS_BYTES_WRITTEN. What's the difference between these 2 fields? Thanks, Pedro -- Met vriendelijke groeten, Niels Basjes
Re: Control the number of Mappers
Hi, 2010/11/25 Shai Erera ser...@gmail.com: Is there a way to make MapReduce create exactly N Mappers? More specifically, if say my data can be split to 200 Mappers, and I have only 100 cores, how can I ensure only 100 Mappers will be created? The number of cores is not something I know in advance, so writing a special InputFormat might be tricky, unless I can query Hadoop for the available # of cores (in the entire cluster). You can configure on a node by node basis how many map and reduce tasks can be started by the task tracker on that node. This is done via the conf/mapred-site.xml using these two settings: mapred.tasktracker.{map|reduce}.tasks.maximum Have a look at this page for more information http://hadoop.apache.org/common/docs/current/cluster_setup.html -- Met vriendelijke groeten, Niels Basjes
Re: Control the number of Mappers
Ah, In that case this should answer your question: http://wiki.apache.org/hadoop/HowManyMapsAndReduces 2010/11/25 Shai Erera ser...@gmail.com: I wasn't talking about how to configure the cluster to not invoke more than a certain # of Mappers simultaneously. Instead, I'd like to configure a (certain) job to invoke exactly N Mappers, where N is the number of cores in the cluster. Irregardless of the size of the data. This is not critical if it can't be done, but it can improve the performance of my job if it can be done. Thanks Shai On Thu, Nov 25, 2010 at 9:55 PM, Niels Basjes ni...@basjes.nl wrote: Hi, 2010/11/25 Shai Erera ser...@gmail.com: Is there a way to make MapReduce create exactly N Mappers? More specifically, if say my data can be split to 200 Mappers, and I have only 100 cores, how can I ensure only 100 Mappers will be created? The number of cores is not something I know in advance, so writing a special InputFormat might be tricky, unless I can query Hadoop for the available # of cores (in the entire cluster). You can configure on a node by node basis how many map and reduce tasks can be started by the task tracker on that node. This is done via the conf/mapred-site.xml using these two settings: mapred.tasktracker.{map|reduce}.tasks.maximum Have a look at this page for more information http://hadoop.apache.org/common/docs/current/cluster_setup.html -- Met vriendelijke groeten, Niels Basjes -- Met vriendelijke groeten, Niels Basjes
Re: Predicting how many values will I see in a call to reduce?
Hi, 2010/11/7 Anthony Urso anthony.u...@gmail.com Is there any way to know how many values I will see in a call to reduce without first counting through them all with the iterator? Under 0.21? 0.20? 0.19? I've looked for an answer to the same question a while ago and came to the conclusion that you can't. The main limit is that the Iterator does not have a size or length method. -- Met vriendelijke groeten, Niels Basjes
Re: Duplicated entries with map job reading from HBase
Hi, I don't know the answer (simply not enough information in your email) but I'm willing to make a guess: You are running on a system with two processing nodes? If so then try removing the Combiner. The combiner is a performance optimization and the whole processing should work without it. Some times there is a design fault in the processing and the combiner disrupts the processing. HTH Niels Basjes 2010/11/5 Adam Phelps a...@opendns.com I've noticed an odd behavior with a map-reduce job I've written which is reading data out of an HBase table. After a couple days of poking at this I haven't been able to figure out the cause of the problem, so I figured I'd ask on here. (For reference I'm running with the cdh3b2 release) The problem is that it seems that every line from the HBase table is passed to the mappers twice, thus resulting in counts ending up as exactly double what they should be. I set up the job like this: Scan scan = new Scan(); scan.addFamily(Bytes.toBytes(scanFamily)); TableMapReduceUtil.initTableMapperJob(table, scan, mapper, Text.class, LongWritable.class, job); job.setCombinerClass(LongSumReducer.class); job.setReducerClass(reducer); I've set up counters in the mapper to verify what is happening, so that I know for certain that the mapper is being called twice with the same bit of data. I've also confirmed (using the hbase shell) that each entry appears only once in the table. Is there a known bug along these lines? If not, does anyone have any thoughts on what might be causing this or where I'd start looking to diagnose? Thanks - Adam -- Met vriendelijke groeten, Niels Basjes
Understanding FileInputFormat and isSplittable.
Hi, The last few weeks we built an application using Hadoop. Because we're implementing against special logfiles (line oriented, textual and gzipped) and we wanted to extract specific fields from those file before putting it into our mapper. We chose to implement our own derivative of the FileInputFormat class to do this. All went well until we tried it with big (100MiB and bigger) files. We then noticed that a lot of the values became doubled, and when we put in full size production data for a test run we found single events to be counted as 36. It took a while to figure out what went wrong but essentially the root cause was the fact that the isSplittable method returned true for Gzipped files (which aren't splittable). The implementation that was used was the one in FileInputFormat. The documentation for this method states Is the given filename splitable? Usually, true, but if the file is stream compressed, it will not be.. This documentation gave us the illusion that the default FileInputFormat implementation would handle compression correctly. Because it all worked with small Gzipped files we never expected the real implementation of this method to be return true; Because of this true value somehow the framework decided to read each input file fully the number of times it wanted to split it. With really messy effects in our case. The derived TextInputFormat class does have a compression aware implementation of isSplittable. Given my current knowledge of Hadoop; I would have chosen to let the default isSplittable implementation (i.e. the one in FileInputFormat) be either safe (always return false) or correct (return whatever is right for the applicable compression). The latter would make the implementation match what I would expect from the documentation. I would like to understand the logic behind the current implementation choice in relation to what I expected (mainly from the documentation). Thanks for explaining. -- Best regards, Niels Basjes