Is this architecture possible (Hadoop, HBase)?
Hello all, I am relatively new to MapReduce and haven't used HBase at all. Is the following architecture possible? A distributed key-value store is used (HBase). So along with values, there would be a timestamp associated with the values. Map Reduce tasks are executed iteratively. Map, in each iteration should take in values which were added in the previous iteration to the store (perhaps the ones with latest timestamp?). Reduce should take in Map's output as well as the key,value pairs from the store whose key(s) match the key(s) that reduce has to process in the current iteration. The output of reduce goes to the store. If this is possible, which classes (eg: InputFormat, run() of Reduce) should be extended so that instead of the regular operation the above operation takes place. If this is not possible, are there any alternatives to achieve the same? Thank you. PS: I have put the same question on mapreduce-user apache mailing list (but haven't got any replies yet). I found many topics on mapreduce in this mailing list as well, so thought of posting it here also. Regards, Raghava.
Re: Is this architecture possible (Hadoop, HBase)?
Raghava, What you ask should be easily doable using HBase specific input and output format and I believe they already exist in a form in which you can use them with minimal or no modification. For specifics of the format classes check out Hbase's API documentation. Also you are better off posting Hbase specific questions on Hbase user group mailing list. Regards -...@nkur On 2/15/10 1:35 PM, Raghava Mutharaju m.vijayaragh...@gmail.com wrote: Hello all, I am relatively new to MapReduce and haven't used HBase at all. Is the following architecture possible? A distributed key-value store is used (HBase). So along with values, there would be a timestamp associated with the values. Map Reduce tasks are executed iteratively. Map, in each iteration should take in values which were added in the previous iteration to the store (perhaps the ones with latest timestamp?). Reduce should take in Map's output as well as the key,value pairs from the store whose key(s) match the key(s) that reduce has to process in the current iteration. The output of reduce goes to the store. If this is possible, which classes (eg: InputFormat, run() of Reduce) should be extended so that instead of the regular operation the above operation takes place. If this is not possible, are there any alternatives to achieve the same? Thank you. PS: I have put the same question on mapreduce-user apache mailing list (but haven't got any replies yet). I found many topics on mapreduce in this mailing list as well, so thought of posting it here also. Regards, Raghava.
Re: Is this architecture possible (Hadoop, HBase)?
Hi Ankur, Thank you for the reply. I will check out HBase API. Since my question involved a mix of MapReduce HBase, I thought of posting it here. If I have any HBase specific questions, I would post them there. Regards, Raghava. On Mon, Feb 15, 2010 at 3:55 AM, Ankur C. Goel gan...@yahoo-inc.com wrote: Raghava, What you ask should be easily doable using HBase specific input and output format and I believe they already exist in a form in which you can use them with minimal or no modification. For specifics of the format classes check out Hbase's API documentation. Also you are better off posting Hbase specific questions on Hbase user group mailing list. Regards -...@nkur On 2/15/10 1:35 PM, Raghava Mutharaju m.vijayaragh...@gmail.com wrote: Hello all, I am relatively new to MapReduce and haven't used HBase at all. Is the following architecture possible? A distributed key-value store is used (HBase). So along with values, there would be a timestamp associated with the values. Map Reduce tasks are executed iteratively. Map, in each iteration should take in values which were added in the previous iteration to the store (perhaps the ones with latest timestamp?). Reduce should take in Map's output as well as the key,value pairs from the store whose key(s) match the key(s) that reduce has to process in the current iteration. The output of reduce goes to the store. If this is possible, which classes (eg: InputFormat, run() of Reduce) should be extended so that instead of the regular operation the above operation takes place. If this is not possible, are there any alternatives to achieve the same? Thank you. PS: I have put the same question on mapreduce-user apache mailing list (but haven't got any replies yet). I found many topics on mapreduce in this mailing list as well, so thought of posting it here also. Regards, Raghava.
RE: sort at reduce side
So, after shuffle at reduce side, are the spills actually stored as map files? Yes. When a reducer receives map output from multiple maps worth fs.inmemorysize.mb in size, it sorts and spills the data to disk. If the number of map output data files spilt to disk exceeds io.sort.factor, additional merge step(s) are performed to reduce the file count to lesser than that number. When all map output data is spilt to disk, the reduce tasks starts invoking the reduce function and passes key-value pairs in sorted order by reading them off from the above files. Regards, Sriguru -Original Message- From: Gang Luo [mailto:lgpub...@yahoo.com.cn] Sent: Thursday, February 04, 2010 1:58 AM To: common-user@hadoop.apache.org Subject: Re: sort at reduce side Thanks for reply, Sriguru. So, after shuffle at reduce side, are the spills actually stored as map files? Why I ask these questions is based on some observations as following. On a 16 nodes cluster, when I do a map join, it takes 3 and a half minutes. When I do a reduce side join on nearly the same amount of data, it take 8 minutes before map phase complete. I am sure the computation (map function) will not cause so much difference, the extra 4 minutes time could be only spent on sorting at map side for reduce side join. While I also notice that the sort time at reduce side is only 30 sec (I cannot access the online jobtracker, the 30 sec time is actually the time reduce takes from 33% completeness to 66% completeness). The number of reduce tasks is much fewer than that of map tasks, which means each reduce task sort more data than each map task (I use hash partitioner and data is uniformly distributed). The only reason I come up with for the big difference between the sort at map side and reduce side is the different behaviors of these two sorts. Anybody has some ideas why the map takes so much time for reduce side join compared to map side join, and why there is big difference between sort at map side and reduce side? P.S. I join a 7.5G file with a 100M file. the sort buffer at reduce is slightly large than that at map side. -Gang - 原始邮件 发件人: Srigurunath Chakravarthi srig...@yahoo-inc.com 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org 发送日期: 2010/2/3 (周三) 12:50:08 上午 主 题: RE: sort at reduce side Hi Gang, kept in map file. If so, in order to efficiently sort the data, reducer actually only read the index part of each spill (which is a map file) and sort the keys, instead of reading whole records from disk and sort them. afaik, no. Reduces always fetches map output data and not indexes (even if the data is from the local node, where an index may be sufficient). Regards, Sriguru -Original Message- From: Gang Luo [mailto:lgpub...@yahoo.com.cn] Sent: Wednesday, February 03, 2010 10:40 AM To: common-user@hadoop.apache.org Subject: sort at reduce side Hi all, I want to know some more details about the sorting at the reduce side. The intermediate result generated at the map side is stored as map file which actually consists of two sub-files, namely index file and data file. The index file stores the keys and it could point to corresponding record stored in the data file. What I think is that when intermediate result (even only part of it for each mapper) is shuffled to reducer, it is still kept in map file. If so, in order to efficiently sort the data, reducer actually only read the index part of each spill (which is a map file) and sort the keys, instead of reading whole records from disk and sort them. Does reducer actually do as what I expect? -Gang ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/ ___ 好玩贺卡等你发,邮箱贺卡全新上线! http://card.mail.cn.yahoo.com/
Re: datanode can not connect to the namenode
Marc Sturlese wrote: Hey there I have a hadoop cluster build with 2 servers. One node (A) contains the namenode, a datanode, the jobtraker and a tasktraker. The other node(B) just has a datanode and a tasktraker. I set up correctly hdfs with ./start-hdfs.sh When I try to set up MapReduce with ./start-mapred.sh the TaskTraker of node (B) can not connect to the namenode. The log will keep throwing: INFO org.apache.hadoop.ipc.Client: Retrying connect to server: mynamenode/192.168.0.13:8021. Already tried 8 time(s) I think maybe something is missing in /etc/hosts file or this hadoop property is not set correctly: property namedfs.datanode.address/name value0.0.0.0:50010/value description The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. /description /property I try on the namenode: telnet localhost 8021 telnet 192.168.0.10 8021 Both cases I get: telnet: Unable to connect to remote host: Connection refused That's the Namenode that isn't there, which should be on 192.168.0.13 at 8021 You can see what ports really are open on the namenode by identifying the process (jps -v will do that) then netstat -a -p | grep PID , where PID= process ID of the namenode. If you can telnet to port 8021 when logged in to the NN, but not remotely, it's your firewall or routing interfering
Re: Problem with large .lzo files
On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote: Hi Steve, I'm not sure here whether you mean that the DistributedLzoIndexer job is failing, or if the job on the resulting split file is faiing. Could you clarify? DistributedLzoIndexer job did successfully complete. It was one of the jobs on the resulting split file always failed while other splits succeeded. By the way, if all files have been indexed, DistributedLzoIndexer does not detect that and hadoop throws an exception complaining that the input dir (or file) does not exist. I work around this by catching the exception. - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Is this reproducible every time? If so, is it always at the same point in the LZO file that it occurs? It's at the same point. Do you know how to print out the lzo index for the task? I only print out the input file now. Would it be possible to download that lzo file to your local box and use lzop -d to see if it decompresses successfully? That way we can isolate whether it's a compression bug or decompression. Bothe java LzoDecompressor and lzop -d were able to decompress the file correctly. As a matter of fact, my job does not index .lzo files now but process each as a whole and it works
Why is $JAVA_HOME/lib/tools.jar in the classpath?
Hi, I'm working on the Debian package for hadoop (the first version is already in the new queue for Debian unstable). Now I stumpled about $JAVA_HOME/lib/tools.jar in the classpath. Since Debian supports different JAVA runtimes, it's not that easy to know, which one the user currently uses and therefor I'd would make things easier if this jar would not be necessary. From searching and inspecting the SVN history I got the impression, that this is an ancient legacy that's not necessary (anymore)? Best regards, Thomas Koch, http://www.koch.ro
Trouble with secondary indexing?
Hi, First time posting... Got an interesting problem with setting up secondary indexing in Hbase. We're running with Cloudera's distribution, and we have secondary indexing up and running on our 'sandbox' machines. I'm installing it on our development cloud, and I know I did something totally 'brain dead' but can't figure out what it is... (Sort of like looking for your keys and they are in plain sight on the table in front of you...) We've got Hadoop up and running, and Hbase up and running without the secondary indexing. We've added the two properties in to the hbase-site.xml and then in our .kshrc (yes we use ksh :-) I've added $HBASE_HOME/contrib/transactional to the CLASSPATH shell variable. The error we get is: 2010-02-15 08:39:30,263 ERROR org.apache.hadoop.hbase.master.HMaster: Can not start master java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.hbase.master.HMaster.doMain(HMaster.java:1241) at org.apache.hadoop.hbase.master.HMaster.main(HMaster.java:1282) Caused by: java.lang.UnsupportedOperationException: Unable to find region server interface org.apache.hadoop.hbase.ipc.IndexedRegionInterface When I look at the error, I'm thinking that I'm missing the path to the CLASS, however on a different cluster of machines running Cloudera's +152 release (HBase-0.20.0) don't have a problem. So what am I missing? TIA! -Mikey _ Your E-mail and More On-the-Go. Get Windows Live Hotmail Free. http://clk.atdmt.com/GBL/go/201469229/direct/01/
Re: hive-0.4.0 build
The MD5 checksum file for the 0.20.1 tarball contains an MD5 checksum along with various SHA checksums and an RMD160 checksum: hadoop-0.20.1.tar.gz:MD5 = 71 9E 16 9B 77 60 C1 68 44 1B 49 F4 05 85 5B 72 hadoop-0.20.1.tar.gz: SHA1 = 712A EE9C 279F 1031 1F83 657B 2B82 7ACA 0374 6613 hadoop-0.20.1.tar.gz: RMD160 = 4331 4350 27E9 E16D 055C F23F FFEF 1564 E206 B144 hadoop-0.20.1.tar.gz: SHA224 = A5E4CBE9 EBBE5FE1 2020F3F1 BFBC6D3C C77A8E9B E6C6062C A6484BDB ... The convention that most tools expect (including Ivy) is that a file named xyz.tar.gz.md5 was generated by running the following command: % md5sum xyz.tar.gz xyz.tar.gz.md5 All of the other Hadoop releases on archive.apache.org follow this convention, with the exception of hadoop-0.20.1. Can someone with access to the Apache archive repository please fix the 0.20.1 md5 checksum file? Thanks. Carl On Mon, Oct 19, 2009 at 9:02 AM, Schubert Zhang zson...@gmail.com wrote: 1. When I build hive-0.4.0, ivy would try to download hadoop 0.17.2.1, 0.18.3, 0.19.0 and 0.20.0. But always fail for 0.17.2.1. 2. Then I modified shims/ivy.xml and shims/build.xml to remove dependencies of 0.17.2.1, 0.18.3, 0.19.0. It works find to only download hadoop-0.20.0. 3. Now, I want do depend hadoop-0.20.1, I modified the above two xml to add 0.20.1, but failed for md5, following is the errors. [ivy:retrieve] [ivy:retrieve] :: problems summary :: [ivy:retrieve] WARNINGS [ivy:retrieve] [FAILED ] hadoop#core;0.20.1!hadoop.tar.gz(source): invalid md5: expected=hadoop-0.20.1.tar.gz: computed=719e169b7760c168441b49f405855b72 (138662ms) [ivy:retrieve] [FAILED ] hadoop#core;0.20.1!hadoop.tar.gz(source): invalid md5: expected=hadoop-0.20.1.tar.gz: computed=719e169b7760c168441b49f405855b72 (138662ms) [ivy:retrieve] hadoop-resolver: tried [ivy:retrieve] http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz [ivy:retrieve] :: [ivy:retrieve] :: FAILED DOWNLOADS:: [ivy:retrieve] :: ^ see resolution messages for details ^ :: [ivy:retrieve] :: [ivy:retrieve] :: hadoop#core;0.20.1!hadoop.tar.gz(source) [ivy:retrieve] :: [ivy:retrieve] [ivy:retrieve] :: USE VERBOSE OR DEBUG MESSAGE LEVEL FOR MORE DETAILS I checked the http://archive.apache.org/dist/hadoop/core/hadoop-0.20.1/hadoop-0.20.1.tar.gz.md5 , and found the format of this file is different from other hadoop releases. Please the releaser of hadoop and hive to have a check. Schubert
Re: Problem with large .lzo files
On Mon, Feb 15, 2010 at 8:07 AM, Steve Kuo kuosen...@gmail.com wrote: On Sun, Feb 14, 2010 at 12:46 PM, Todd Lipcon t...@cloudera.com wrote: By the way, if all files have been indexed, DistributedLzoIndexer does not detect that and hadoop throws an exception complaining that the input dir (or file) does not exist. I work around this by catching the exception. Just fixed that in my github repo. Thanks for the bug report. - It's possible to sacrifice parallelism by having hadoop work on each .lzo file without indexing. This worked well until the file size exceeded 30G when array indexing exception got thrown. Apparently the code processed the file in chunks and stored the references to the chunk in an array. When the number of chunks was greater than a certain number (around 256 was my recollection), exception was thrown. - My current work around is to increase the number of reducers to keep the .lzo file size low. I would like to get advices on how people handle large .lzo files. Any pointers on the cause of the stack trace below and best way to resolve it are greatly appreciated. Is this reproducible every time? If so, is it always at the same point in the LZO file that it occurs? It's at the same point. Do you know how to print out the lzo index for the task? I only print out the input file now. You should be able to downcast the InputSplit to FileSplit, if you're using the new API. From there you can get the start and length of the split. Would it be possible to download that lzo file to your local box and use lzop -d to see if it decompresses successfully? That way we can isolate whether it's a compression bug or decompression. Bothe java LzoDecompressor and lzop -d were able to decompress the file correctly. As a matter of fact, my job does not index .lzo files now but process each as a whole and it works Interesting. If you can somehow make a reproducible test case I'd be happy to look into this. Thanks -Todd
Check out our new, easy, scalable, real-time Data Platform :) And vote!
Hey guys! As some of you know from my blog (and occasional posts here), Drawn to Scale been building a complete end-to-end platform that makes dealing with data easy and scalable. You can Process, Query, Serve, Store, and Search *all* you data. We've made our big public announcement today: http://www.roadtofailure.com/2010/02/15/announcing-the-drawn-to-scale-platform/ Part of our product (storage and processing) runs on Hadoop and Hbase. Check it out, would love to get feedback. We also have a video up describing our product. We're trying to get a slot at the CloudConnect Launchpad. We would love your vote -- we'll bribe you with beer. Vote here: http://bit.ly/9fgH6p -- http://www.drawntoscalehq.com -- Big Data for all. The Big Data Platform. http://www.roadtofailure.com -- The Fringes of Scalability, Social Media, and Computer Science
Re: Problem with large .lzo files
You should be able to downcast the InputSplit to FileSplit, if you're using the new API. From there you can get the start and length of the split. Cool, let me give it a shot. Interesting. If you can somehow make a reproducible test case I'd be happy to look into this. This sounds great. As the input file is 1G, let me do some work on my side to see if I can pinpoint it so as not have to transfer a 1G file around. Thanks.
Sun JVM 1.6.0u18
Hey all, Just a note that you should avoid upgrading your clusters to 1.6.0u18. We've seen a lot of segfaults or bus errors on the DN when running with this JVM - Stack found the ame thing on one of his clusters as well. We've found 1.6.0u16 to be very stable. -Todd