Re: tutorial on Hadoop/Hbase utility classes
Thanks for putting this up, it's very useful. I'd encourage you to contribute this a documentation patch so that you help everyone who comes to hadoop.apache.org, plus you can be a part of the project and a contributor. I can help with the mechanics - here is a link to help you get started: http://wiki.apache.org/hadoop/HowToContribute Arun On Aug 31, 2011, at 4:57 PM, Sujee Maniyam wrote: Here is a tutorial on some handy Hadoop classes - with sample source code. http://sujee.net/tech/articles/hadoop-useful-classes/ Would appreciate any feedback / suggestions. thanks all Sujee Maniyam http://sujee.net
Re: Binary content
On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc. yes, it works. you just need to select the right input format. Personally i store all my binary files into a sequencefile (because my binary files are small) Dieter
Timer jobs
Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen
Re: Timer jobs
Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
Re: Hadoop with Netapp
On 25/08/11 08:20, Sagar Shukla wrote: Hi Hakan, Please find my comments inline in blue : -Original Message- From: Hakan (c)lter [mailto:hakanil...@gmail.com] Sent: Thursday, August 25, 2011 12:28 PM To: common-user@hadoop.apache.org Subject: Hadoop with Netapp Hi everyone, We are going to create a new Hadoop cluster in our company, i have to get some advises from you: 1. Does anyone have stored whole Hadoop data not on local disks but on Netapp or other storage system? Do we have to store datas on local disks, if so is it because of performace issues? sagar: Yes, we were using SAN LUNs for storing Hadoop data. SAN works faster than NAS in terms of performance while writing the data to the storage. Also SAN LUNs can be auto-mounted while booting up the system. Silly question: why? SANs are SPOFs (Gray van Ingen, MS, 2005; SAN responsible for 11% of terraserver downtime). Was it because you had the rack and wanted to run Hadoop, or did you want a more agile cluster? Because it's going to increase your cost of storage dramatically, which means you pay more per TB, or end up with less TB of storage. I wouldn't go this way for a dedicated Hadoop cluster. For a multi-use cluster, it's a different story 2. What do you think about running Hadoop nodes in virtual (VMware) servers? sagar: If high speed computing is not a requirement for you then Hadoop nodes in VM environment could be a good option, but one other slight drawback is when the VM crashes recovery of the in-memory data would be gone. Hadoop takes care of some amount of failover, but there is some amount of risk involved and requires good HA building capabilities. I do it for dev and test work, and for isolated clusters in a shared environment. -for CPU bound stuff, it actually works quite well, as there's no significant overhead -for HDD access, reading from the FS, writing to the FS and to store transient spill data you take a tangible performance hit. That's OK if you can afford to wait or rent a few extra CPUs -and your block size is such that those extra servers can help out -which may be in the map phase more than the reduce phase Some Hadoop-ish projects -Stratosphere from TuB in particular- are designed for VM infrastructure so come up with execution plans to use VMs efficiently. -steve
Re: Turn off all Hadoop logs?
On 29/08/11 20:31, Frank Astier wrote: Is it possible to turn off all the Hadoop logs simultaneously? In my unit tests, I don’t want to see the myriad “INFO” logs spewed out by various Hadoop components. I’m using: ((Log4JLogger) DataNode.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) LeaseManager.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) FSNamesystem.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) DFSClient.LOG).getLogger().setLevel(Level.OFF); ((Log4JLogger) Storage.LOG).getLogger().setLevel(Level.OFF); But I’m still missing some loggers... you need a log4j.properties file on the CP that doesn't log so much. I do this by -removing /logj4.properties from the Hadoop jars in our (private) jar repository -having custom log4.properties files in the test/ source trees You could also start junit with the right log4j properties to point it at a custom log4j file. I forget what that property is.
Re: Timer jobs
Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen
Re: Timer jobs
If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
I got the problem from Map output lost
From this week,My Hadoop caught his problem with information as following: Lost task tracker: tracker_rsync.host01:localhost/127.0.0.1:40759 Map output lost, rescheduling: getMapOutput(attempt_201108021855_6734_m_97_1,2002) failed : org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201108021855_6734/attempt_201108021855_6734_m_97_1/output/file.out.index in any of the configured local directories at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathToRead(LocalDirAllocator.java:389) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathToRead(LocalDirAllocator.java:138) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:2887) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:502) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:363) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:181) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:417) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:324) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:534) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:864) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:533) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:207) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:403) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:409) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:522) To my application,there are 2 mappers and 2 reducers.And there may be 2000 lost for mappers.So the total hadoop had been delay for this lost.
Problem with Python + Hadoop: how to link .so outside Python?
Hi, I have successfully installed scipy on my Python 2.7 on my local Linux, and I want to pack my Python2.7 (with scipy) onto Hadoop and run my Python MapReduce scripts, like this: 20 ${HADOOP_HOME}/bin/hadoop streaming \$ 21 -input ${input} \$ 22 -output ${output} \$ 23 -mapper python27/bin/python27.sh rp_extractMap.py \$ 24 -reducer python27/bin/python27.sh rp_extractReduce.py \$ 25 -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \$ 26 -file rp_extractMap.py \$ 27 -file rp_extractReduce.py \$ 28 -file shitu_conf.py \$ 29 -cacheArchive /share/python27.tar.gz#python27 \$ 30 -outputformat org.apache.hadoop.mapred.TextOutputFormat \$ 31 -inputformat org.apache.hadoop.mapred.CombineTextInputFormat \$ 32 -jobconf mapred.max.split.size=51200 \$ 33 -jobconf mapred.job.name=[reserve_price][rp_extract] \$ 34 -jobconf mapred.job.priority=HIGH \$ 35 -jobconf mapred.job.map.capacity=1000 \$ 36 -jobconf mapred.job.reduce.capacity=200 \$ 37 -jobconf mapred.reduce.tasks=200$ 38 -jobconf num.key.fields.for.partition=2$ I have to do this, because the Hadoop server installed its own python of very low version which may not support some of my python scripts, and I do not have privilege to install scipy lib on that server. So,I have to use the -cacheArchieve command to include my own python2.7 with scipy But, I find out that some of the .so in scipy are linked to other dynamic libs outside Python2.7.. For example $ ldd ~/local/python-2.7.2/lib/python2.7/site-packages/scipy/linalg/flapack.so liblapack.so = /usr/local/atlas/lib/liblapack.so (0x002a956fd000) libatlas.so = /usr/local/atlas/lib/libatlas.so (0x002a95df3000) libgfortran.so.3 = /home/work/local/gcc-4.6.1/lib64/libgfortran.so.3 (0x002a9668d000) libm.so.6 = /lib64/tls/libm.so.6 (0x002a968b6000) libgcc_s.so.1 = /home/work/local/gcc-4.6.1/lib64/libgcc_s.so.1 (0x002a96a3c000) libquadmath.so.0 = /home/work/local/gcc-4.6.1/lib64/libquadmath.so.0 (0x002a96b51000) libc.so.6 = /lib64/tls/libc.so.6 (0x002a96c87000) libpthread.so.0 = /lib64/tls/libpthread.so.0 (0x002a96ebb000) /lib64/ld-linux-x86-64.so.2 (0x00552000) So, my question is: how can I include this libs? Should I search for all the linked .so and .a under my local linux and pack them together with Python2.7??? If yes, How can I get a full list of the libs needed and How can make the packed Python2.7 know where to find the new libs?? Thanks Xiong
Re: Timer jobs
[moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution for HA for the database, well, it is left to the database. This means, you'll have to get an HA DB. The solution for HA for the Oozie app is deploying the servlet container with the Oozie app in more than one box (2 or 3); and front them by a HTTP load-balancer. The missing part is that the current Oozie lock-service is currently an in-memory implementation. This should be replaced with a Zookeeper implementation. Zookeeper could run externally or internally in all Oozie servers. This is what was prototyped long ago. Thanks. Alejandro On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
Re: Timer jobs
Thanks for your response. See comments below. Regards, Per Steffensen Alejandro Abdelnur skrev: [moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Ok, so if it isnt HA out-of-the-box I believe Oozie is too big a framework for my needs - I dont need all this workflow stuff - just a plain simple job trigger that triggers every 5th minute. I guess I will try out something smaller like Quartz Scheduler. It also only have HA/cluster support through JDBC (JobStore) but I guess I could fairly easy make a HDFSFilesJobStore which still hold the properties so that Quartz clustering works. But what I would really like to have is a scheduling framework that is HA out-of-the-box. Guess Oozie is not the solution for me. Anyone knows about other frameworks? Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution for HA for the database, well, it is left to the database. This means, you'll have to get an HA DB. I would really like to avoid having to run a relational database. Couldnt I just do the persistence of Oozie jobs state in files on HDFS? The solution for HA for the Oozie app is deploying the servlet container with the Oozie app in more than one box (2 or 3); and front them by a HTTP load-balancer. The missing part is that the current Oozie lock-service is currently an in-memory implementation. This should be replaced with a Zookeeper implementation. Zookeeper could run externally or internally in all Oozie servers. This is what was prototyped long ago. Yes but if I have to do ZooKeeper stuff I could just do the scheduler myself and make run no all/many boxes. The only hard part about it is the locking thing that makes sure only one job-triggering happens in the entire cluster when only one job-triggering is supposed to happen, and that the job-triggering happens no matter how many machines might be down. Thanks. Alejandro On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com
Re: Creating a hive table for a custom log
Hi, On Thu, Sep 1, 2011 at 9:08 AM, Raimon Bosch raimon.bo...@gmail.com wrote: Hi, I'm trying to create a table similar to apache_log but I'm trying to avoid to write my own map-reduce task because I don't want to have my HDFS files twice. So if you're working with log lines like this: 186.92.134.151 [31/Aug/2011:00:10:41 +] GET /client/action1/?transaction_id=8002user_id=87179311248ts=1314749223525item1=271item2=6045environment=2 HTTP/1.1 112.201.65.238 [31/Aug/2011:00:10:41 +] GET /client/action1/?transaction_id=9002ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2 HTTP/1.1 90.45.198.251 [31/Aug/2011:00:10:41 +] GET /client/action2/?transaction_id=9022ts=1314749223525user_id=9048871793100item2=6045item1=271environment=2 HTTP/1.1 And having in mind that the parameters could be in different orders. Which will be the best strategy to create this table? Write my own org.apache.hadoop.hive.contrib.serde2? Is there any resource already implemented that I could use to perform this task? I would use the regex serde to parse them: CREATE EXTERNAL TABLE access_log (ip STRING, dt STRING, request STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES (input.regex = ([\\d.]+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \(.+?)\) LOCATION '/path/to/file'; That will parse the three fields out and could be modified to separate out the action. Then I think you will need to parse the query string in Hive itself. In the end the objective is convert all the parameters in fields and use as type the action. With this big table I will be able to perform my queries, my joins or my views. Any ideas? Thanks in Advance, Raimon Bosch. -- View this message in context: http://old.nabble.com/Creating-a-hive-table-for-a-custom-log-tp32379849p32379849.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Timer jobs
On Thu, Sep 1, 2011 at 7:58 PM, Per Steffensen st...@designware.dk wrote: Thanks for your response. See comments below. Regards, Per Steffensen Alejandro Abdelnur skrev: [moving common-user@ to BCC] Oozie is not HA yet. But it would be relatively easy to make it. It was designed with that in mind, we even did a prototype. Ok, so if it isnt HA out-of-the-box I believe Oozie is too big a framework for my needs - I dont need all this workflow stuff - just a plain simple job trigger that triggers every 5th minute. I guess I will try out something smaller like Quartz Scheduler. It also only have HA/cluster support through JDBC (JobStore) but I guess I could fairly easy make a HDFSFilesJobStore which still hold the properties so that Quartz clustering works. But what I would really like to have is a scheduling framework that is HA out-of-the-box. Guess Oozie is not the solution for me. Anyone knows about other frameworks? This is similar to my requirement. Only that I already have Quartz scheduling my jobs and haven't started using Hadoop yet. I plan to wrap Quartz jobs to internally call Hadoop jobs. I'm still in the design phase though. Hopefully, it will be successful. Oozie consists of 2 services, a SQL database to store the Oozie jobs state and a servlet container where Oozie app proper runs. The solution for HA for the database, well, it is left to the database. This means, you'll have to get an HA DB. I would really like to avoid having to run a relational database. Couldnt I just do the persistence of Oozie jobs state in files on HDFS? The solution for HA for the Oozie app is deploying the servlet container with the Oozie app in more than one box (2 or 3); and front them by a HTTP load-balancer. The missing part is that the current Oozie lock-service is currently an in-memory implementation. This should be replaced with a Zookeeper implementation. Zookeeper could run externally or internally in all Oozie servers. This is what was prototyped long ago. Yes but if I have to do ZooKeeper stuff I could just do the scheduler myself and make run no all/many boxes. The only hard part about it is the locking thing that makes sure only one job-triggering happens in the entire cluster when only one job-triggering is supposed to happen, and that the job-triggering happens no matter how many machines might be down. Thanks. Alejandro On Thu, Sep 1, 2011 at 4:14 AM, Ronen Itkin ro...@taykey.com wrote: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- * Ronen Itkin* Taykey | www.taykey.com --
Re: Timer jobs
Well I am not sure I get you right, but anyway, basically I want a timer framework that triggers my jobs. And the triggering of the jobs need to work even though one or two particular machines goes down. So the timer triggering mechanism has to live in the cluster, so to speak. What I dont want is that the timer framework are driven from one particular machine, so that the triggering of jobs will not happen if this particular machine goes down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I want my timer framework to also be clustered, distributed and coordinated, so that I will also have my timer jobs triggered even though 3 out of 10 machines are down. Regards, Per Steffensen Ronen Itkin skrev: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen
Re: Binary content
On Thu, Sep 1, 2011 at 1:25 AM, Dieter Plaetinck dieter.plaeti...@intec.ugent.be wrote: On Wed, 31 Aug 2011 08:44:42 -0700 Mohit Anchlia mohitanch...@gmail.com wrote: Does map-reduce work well with binary contents in the file? This binary content is basically some CAD files and map reduce program need to read these files using some proprietry tool extract values and do some processing. Wondering if there are others doing similar type of processing. Best practices etc. yes, it works. you just need to select the right input format. Personally i store all my binary files into a sequencefile (because my binary files are small) Thanks! Is there a specific tutorial I can focus on to see how it could be done? Dieter
Re: Timer jobs
In Hadoop, if the client that triggers the job fails, is there a way to recover and another client to submit the job? On Thu, Sep 1, 2011 at 8:44 PM, Per Steffensen st...@designware.dk wrote: Well I am not sure I get you right, but anyway, basically I want a timer framework that triggers my jobs. And the triggering of the jobs need to work even though one or two particular machines goes down. So the timer triggering mechanism has to live in the cluster, so to speak. What I dont want is that the timer framework are driven from one particular machine, so that the triggering of jobs will not happen if this particular machine goes down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I want my timer framework to also be clustered, distributed and coordinated, so that I will also have my timer jobs triggered even though 3 out of 10 machines are down. Regards, Per Steffensen Ronen Itkin skrev: If I get you right you are asking about Installing Oozie as Distributed and/or HA cluster?! In that case I am not familiar with an out of the box solution by Oozie. But, I think you can made up a solution of your own, for example: Installing Oozie on two servers on the same partition which will be synchronized by DRBD. You can trigger a failover using linux Heartbeat and that way maintain a virtual IP. On Thu, Sep 1, 2011 at 1:59 PM, Per Steffensen st...@designware.dk wrote: Hi Thanks a lot for pointing me to Oozie. I have looked a little bit into Oozie and it seems like the component triggering jobs is called Coordinator Application. But I really see nowhere that this Coordinator Application doesnt just run on a single machine, and that it will therefore not trigger anything if this machine is down. Can you confirm that the Coordinator Application-role is distributed in a distribued Oozie setup, so that jobs gets triggered even if one or two machines are down? Regards, Per Steffensen Ronen Itkin skrev: Hi Try to use Oozie for job coordination and work flows. On Thu, Sep 1, 2011 at 12:30 PM, Per Steffensen st...@designware.dk wrote: Hi I use hadoop for a MapReduce job in my system. I would like to have the job run very 5th minute. Are there any distributed timer job stuff in hadoop? Of course I could setup a timer in an external timer framework (CRON or something like that) that invokes the MapReduce job. But CRON is only running on one particular machine, so if that machine goes down my job will not be triggered. Then I could setup the timer on all or many machines, but I would not like the job to be run in more than one instance every 5th minute, so then the timer jobs would need to coordinate who is actually starting the job this time and all the rest would just have to do nothing. Guess I could come up with a solution to that - e.g. writing some lock stuff using HDFS files or by using ZooKeeper. But I would really like if someone had already solved the problem, and provided some kind of a distributed timer framework running in a cluster, so that I could just register a timer job with the cluster, and then be sure that it is invoked every 5th minute, no matter if one or two particular machines in the cluster is down. Any suggestions are very welcome. Regards, Per Steffensen -- Regards, Tharindu
Re: Binary content
On Thu, Sep 1, 2011 at 8:37 AM, Mohit Anchlia mohitanch...@gmail.comwrote: Thanks! Is there a specific tutorial I can focus on to see how it could be done? Take the word count example and change its output format to be SequenceFileOutputFormat. job.setOutputFormatClass(SequenceFileOutputFormat.class); and it will generate SequenceFiles instead of text. There is SequenceFileInputFormat for reading. -- Owen
Re: Timer jobs
01.09.11 18:14, Per Steffensen написав(ла): Well I am not sure I get you right, but anyway, basically I want a timer framework that triggers my jobs. And the triggering of the jobs need to work even though one or two particular machines goes down. So the timer triggering mechanism has to live in the cluster, so to speak. What I dont want is that the timer framework are driven from one particular machine, so that the triggering of jobs will not happen if this particular machine goes down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I want my timer framework to also be clustered, distributed and coordinated, so that I will also have my timer jobs triggered even though 3 out of 10 machines are down. Hello. AFAIK now you still have HDFS NameNode and as soon as NameNode is down - your cluster is down. So, putting scheduling on the same machine as NameNode won't make you cluster worse in terms of SPOF (at least for HW failures). Best regards, Vitalii Tymchyshyn
cross product of 2 data sets
Hey there, I would like to do the cross product of two data sets, any of them feeds in memory. I've seen pig has the cross operation. Can someone please explain me how it implements it? -- View this message in context: http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: cross product of 2 data sets
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html search on cross matches Alan. On Sep 1, 2011, at 11:44 AM, Marc Sturlese wrote: Hey there, I would like to do the cross product of two data sets, any of them feeds in memory. I've seen pig has the cross operation. Can someone please explain me how it implements it? -- View this message in context: http://lucene.472066.n3.nabble.com/cross-product-of-2-data-sets-tp3302160p3302160.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Timer jobs
Vitalii Tymchyshyn skrev: 01.09.11 18:14, Per Steffensen написав(ла): Well I am not sure I get you right, but anyway, basically I want a timer framework that triggers my jobs. And the triggering of the jobs need to work even though one or two particular machines goes down. So the timer triggering mechanism has to live in the cluster, so to speak. What I dont want is that the timer framework are driven from one particular machine, so that the triggering of jobs will not happen if this particular machine goes down. Basically if I have e.g. 10 machines in a Hadoop cluster I will be able to run e.g. MapReduce jobs even if 3 of the 10 machines are down. I want my timer framework to also be clustered, distributed and coordinated, so that I will also have my timer jobs triggered even though 3 out of 10 machines are down. Hello. AFAIK now you still have HDFS NameNode and as soon as NameNode is down - your cluster is down. So, putting scheduling on the same machine as NameNode won't make you cluster worse in terms of SPOF (at least for HW failures). Best regards, Vitalii Tymchyshyn I believe this is why there is also a secondary namenode. But with two namenodes it is still to centralized in my opinion, but guess Hadoop people know that, and that the namenode-role will be even more distributed in the future. But that does not change the fact that I would like to have a real distributed clustered scheduler.
MultipleOutputs - Create multiple files during output
Hi all, I was wondering if anyone was familiar with this class. I want to create multiple output files during my reduce. My input files will consist of name1action1date1 name1action2date2 name1action3date3 name2action1date1 name2action2date2 name2action3date3 My goal is to create files with the following format Filename: name_Date:CCYYMM File Contents: action1 action2 action3 I.e. This will store all the actions of one person for any given month in one file. I just don't know how I will decide the file name at run time. Can anyone help? Thanks, Tim
Namenode not starting
Hi all, I am trying to install Hadoop (release 0.20.203) on a machine with CentOS. When I try to start HDFS, I get the following error. machine-name: Unrecognized option: -jvm machine-name: Could not create the Java virtual machine. Any idea what might be the problem? Thanks, Abhishek
Re: Namenode not starting
Hi Hailong, I have installed JDK and set JAVA_HOME correctly (as far as I know). Output of java -version is: java version 1.6.0_04 Java(TM) SE Runtime Environment (build 1.6.0_04-b12) Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode) I also have another version installed 1.6.0_27 but get same error with it. Abhishek On Thu, Sep 1, 2011 at 4:00 PM, hailong.yang1115 hailong.yang1...@gmail.com wrote: Hi abhishek, Have you successfully installed java virtual machine like sun JDK before running Hadoop? Or maybe you forget to configure the environment variable JAVA_HOME? What is the output of the command 'java -version'? Regards Hailong *** * Hailong Yang, PhD. Candidate * Sino-German Joint Software Institute, * School of Computer ScienceEngineering, Beihang University * Phone: (86-010)82315908 * Email: hailong.yang1...@gmail.com * Address: G413, New Main Building in Beihang University, * No.37 XueYuan Road,HaiDian District, * Beijing,P.R.China,100191 *** From: abhishek sharma Date: 2011-09-02 03:51 To: common-user; common-dev Subject: Namenode not starting Hi all, I am trying to install Hadoop (release 0.20.203) on a machine with CentOS. When I try to start HDFS, I get the following error. machine-name: Unrecognized option: -jvm machine-name: Could not create the Java virtual machine. Any idea what might be the problem? Thanks, Abhishek
Re: Namenode not starting
Actually, I found the reason. I am running HDFS as root and there is a bug that has recently been fixed. https://issues.apache.org/jira/browse/HDFS-1943 Thanks, Abhishek On Thu, Sep 1, 2011 at 6:25 PM, Ravi Prakash ravihad...@gmail.com wrote: Hi Abhishek, Try reading through the shell scripts before postiing. They are short and simple enough and you should be able to debug them quite easily. I've seen the same error many times. Do you see JAVA_HOME set when you $ssh localhost? Also which command are you using to start the daemons? Fight on, Ravi On Thu, Sep 1, 2011 at 4:35 PM, abhishek sharma absha...@usc.edu wrote: Hi Hailong, I have installed JDK and set JAVA_HOME correctly (as far as I know). Output of java -version is: java version 1.6.0_04 Java(TM) SE Runtime Environment (build 1.6.0_04-b12) Java HotSpot(TM) Server VM (build 10.0-b19, mixed mode) I also have another version installed 1.6.0_27 but get same error with it. Abhishek On Thu, Sep 1, 2011 at 4:00 PM, hailong.yang1115 hailong.yang1...@gmail.com wrote: Hi abhishek, Have you successfully installed java virtual machine like sun JDK before running Hadoop? Or maybe you forget to configure the environment variable JAVA_HOME? What is the output of the command 'java -version'? Regards Hailong *** * Hailong Yang, PhD. Candidate * Sino-German Joint Software Institute, * School of Computer ScienceEngineering, Beihang University * Phone: (86-010)82315908 * Email: hailong.yang1...@gmail.com * Address: G413, New Main Building in Beihang University, * No.37 XueYuan Road,HaiDian District, * Beijing,P.R.China,100191 *** From: abhishek sharma Date: 2011-09-02 03:51 To: common-user; common-dev Subject: Namenode not starting Hi all, I am trying to install Hadoop (release 0.20.203) on a machine with CentOS. When I try to start HDFS, I get the following error. machine-name: Unrecognized option: -jvm machine-name: Could not create the Java virtual machine. Any idea what might be the problem? Thanks, Abhishek
Re: TestDFSIO failure
Hi Matt, On Jun 20, 2011, at 1:46pm, GOEKE, MATTHEW (AG/1000) wrote: Has anyone else run into issues using output compression (in our case lzo) on TestDFSIO and it failing to be able to read the metrics file? I just assumed that it would use the correct decompression codec after it finishes but it always returns with a 'File not found' exception. Yes, I've run into the same issue on 0.20.2 and CHD3u0 I don't see any Jira issue that covers this problem, so unless I hear otherwise I'll file one. The problem is that the post-job code doesn't handle getting the path.deflate or path.lzo (for you) file from HDFS, and then decompressing it. Is there a simple way around this without spending the time to recompile a cluster/codec specific version? You can use hadoop fs -text path reported in exception.lzo This will dump out the file, which looks like: f:rate 171455.11 f:sqrate2981174.8 l:size 1048576 l:tasks 10 l:time 590537 If you take f:rate/1000/l:tasks, that should give you the average MB/sec. E.g. for the example above, that would be 171455/1000/10 = 17MB/sec. -- Ken -- Ken Krugler +1 530-210-6378 http://bixolabs.com custom big data solutions training Hadoop, Cascading, Mahout Solr
Re: MultipleOutputs - Create multiple files during output
Hi Tim, You could create a custom HashPartitioner so that all key,value pairs denoting the actions of the same user end up in the same reducer; then you need only one output file per reducer. Btw, how large are the output files? make sure you don't end up creating a lot of small files, i.e., 64MB. Best, stan On Thu, Sep 1, 2011 at 3:47 PM, modemide modem...@gmail.com wrote: Hi all, I was wondering if anyone was familiar with this class. I want to create multiple output files during my reduce. My input files will consist of name1action1date1 name1action2date2 name1action3date3 name2action1date1 name2action2date2 name2action3date3 My goal is to create files with the following format Filename: name_Date:CCYYMM File Contents: action1 action2 action3 I.e. This will store all the actions of one person for any given month in one file. I just don't know how I will decide the file name at run time. Can anyone help? Thanks, Tim