Re: Working with MapFiles
Hi Ondrej, Pe 02.04.2012 13:00, Ondřej Klimpera a scris: Ok, thanks. I missed setup() method because of using older version of hadoop, so I suppose that method configure() does the same in hadoop 0.20.203. Aha, if it's possible, try upgrading. I don't know how support is for versions older then hadoop 0.20 branch. Now I'm able to load a map file inside configure() method to MapFile.Reader instance as a class private variable, all works fine, just wondering if the MapFile is replicated on HDFS and data are read locally, or if reading from this file will increase the network bandwidth because of getting it's data from another computer node in the hadoop cluster. You could use a method variable instead of a class private if you load the file. If the MapFile is wrote to HDFS then yes it is replicated, and you can configure the replication factor at file creation (and later maybe). If you use DistributedCache then the files are not written in HDFS, but in mapred.local.dir [1] folder on every node. The folder size is configurable so it's possible that the data will be available there for the next MR job but don't rely on this. Please read the docs, I may get things wrong. RTFM will save you life ;). [1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata [2] https://forums.aws.amazon.com/message.jspa?messageID=152538 Hopefully last question to bother you is, if reading files from DistributedCache (normal text file) is limited to particular job. Before running a job I add a file to DistCache. When getting the file in Reducer implementation, can it access DistCache files from another jobs? In another words what will list this command: //Reducer impl. public void configure(JobConf job) { URI[] distCacheFileUris = DistributedCache.getCacheFiles(job); } will the distCacheFileUris variable contain only URIs for this job, or for any job running on Hadoop cluster? Hope it's understandable. Thanks. It's -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: Working with MapFiles
Hi Ondrej, Pe 30.03.2012 14:30, Ondřej Klimpera a scris: And one more question, is it even possible to add a MapFile (as it consits of index and data file) to Distributed cache? Thanks Should be no problem, they are just two files. On 03/30/2012 01:15 PM, Ondřej Klimpera wrote: Hello, I'm not sure what you mean by using map reduce setup()? "If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job." Can you please explain little bit more? Check the javadocs[1]: setup is called once per task so you can read the file from HDFS then or perform other initializations. [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html Reading 20 MB in ram should not be a problem and is preferred if you need to make many requests against that data. It really depends on your use case so think carefully or just go ahead and test it. Thanks On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote: Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: Working with MapFiles
Hello Ondrej, Pe 29.03.2012 18:05, Ondřej Klimpera a scris: Hello, I have a MapFile as a product of MapReduce job, and what I need to do is: 1. If MapReduce produced more spilts as Output, merge them to single file. 2. Copy this merged MapFile to another HDFS location and use it as a Distributed cache file for another MapReduce job. I'm wondering if it is even possible to merge MapFiles according to their nature and use them as Distributed cache file. A MapFile is actually two files [1]: one SequanceFile (with sorted keys) and a small index for that file. The map file does a version of binary search to find your key and performs seek() to go to the byte offset in the file. What I'm trying to achieve is repeatedly fast search in this file during another MapReduce job. If my idea is absolute wrong, can you give me any tip how to do it? The file is supposed to be 20MB large. I'm using Hadoop 0.20.203. If the file is that small you could load it all in memory to avoid network IO. Do that in the setup() method of the map reduce job. The distributed cache will also use HDFS [2] and I don't think it will provide you with any benefits. Thanks for your reply:) Ondrej Klimpera [1] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html [2] http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: how to get rid of -libjars ?
Pe 06.03.2012 17:37, Jane Wayne a scris: currently, i have my main jar and then 2 depedent jars. what i do is 1. copy dependent-1.jar to $HADOOP/lib 2. copy dependent-2.jar to $HADOOP/lib then, when i need to run my job, MyJob inside main.jar, i do the following. hadoop jar main.jar demo.MyJob -libjars dependent-1.jar,dependent-2.jar -Dmapred.input.dir=/input/path -Dmapred.output.dir=/output/path what i want to do is NOT copy the dependent jars to $HADOOP/lib and always specify -libjars. is there any way around this multi-step procedure? i really do not want to clutter $HADOOP/lib or specify a comma-delimited list of jars for -libjars. any help is appreciated. Hello, Specify the full path to the jar on the -libjars? My experience with -libjars is that it didn't work as advertised. Search for an older post on the list about this issue ( -libjars not working). I tried adding a lot of jars and some got on the job classpath (2), some didn't (most of them). I got over this by including all the jars in a lib directory inside the main jar. Cheers, -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: ClassNotFoundException: -libjars not working?
Pe 28.02.2012 10:58, madhu phatak a scris: Hi, -libjars doesn't always work.Better way is to create a runnable jar with all dependencies ( if no of dependency is less) or u have to keep the jars into the lib folder of the hadoop in all machines. Thanks for the reply Madhu, I adopted the second solution as explained in [1]. From what I found browsing the net it seems that -libjars is broken in hadoop version > 0.18. I didn't got time to check the code yet. Cloudera released hadoop sources are packaged a bit odd and Netbeans doens't seem to play well with that and this really affects my will to try to fix the problem. "-libjars" is a nice feature that permits the use of skinny jars and would help system admins do better packaging. It also allows better control over the classpath. Too bad it didn't work. [1] http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ Cheers, -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: LZO with sequenceFile
2012/2/26 Mohit Anchlia : > Thanks. Does it mean LZO is not installed by default? How can I install LZO? The LZO library is released under GPL and I believe it can't be included in most distributions of Hadoop because of this (can't mix GPL with non GPL stuff). It should be easily available though. > On Sat, Feb 25, 2012 at 6:27 PM, Shi Yu wrote: > >> Yes, it is supported by Hadoop sequence file. It is splittable >> by default. If you have installed and specified LZO correctly, >> use these: >> >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setCompressOutput(job,true); >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setOutputCompressorClass(job,com.hadoop.compression.lzo.LzoC >> odec.class); >> >> org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputForma >> t.setOutputCompressionType(job, >> SequenceFile.CompressionType.BLOCK); >> >> job.setOutputFormatClass(org.apache.hadoop.mapreduce.lib.outpu >> t.SequenceFileOutputFormat.class); >> >> >> Shi >> -- Ioan Eugen Stan http://ieugen.blogspot.com/
ClassNotFoundException: -libjars not working?
are/mailbox-convertor/lib/servlet-api-2.5-6.1.14.jar,/usr/share/mailbox-convertor/lib/antisamy-1.4.4.jar,/usr/share/mailbox-convertor/lib/antisamy-sample-configs-1.4.4.jar,/usr/share/mailbox-convertor/lib/jcl-over-slf4j-1.6.1.jar,/usr/share/mailbox-convertor/lib/jul-to-slf4j-1.6.1.jar,/usr/share/mailbox-convertor/lib/slf4j-api-1.6.1.jar,/usr/share/mailbox-convertor/lib/slf4j-log4j12-1.6.1.jar,/usr/share/mailbox-convertor/lib/spring-aop-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-asm-3.1.0.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-beans-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-context-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-core-3.0.5.RELEASE.jar,/usr/share/mailbox-convertor/lib/spring-expression-3.1.0.RELEASE.jar,/usr/share/mailbox-convertor/lib/uncommon s-maths-1.2.2.jar,/usr/share/mailbox-convertor/lib/watchmaker-framework-0.6.2.jar,/usr/share/mailbox-convertor/lib/snappy-java-1.0.3.2.jar,/usr/share/mailbox-convertor/lib/snakeyaml-1.6.jar,/usr/share/mailbox-convertor/lib/oro-2.0.8.jar,/usr/share/mailbox-convertor/lib/stax-api-1.0.1.jar,/usr/share/mailbox-convertor/lib/jasper-compiler-5.5.23.jar,/usr/share/mailbox-convertor/lib/jasper-runtime-5.5.23.jar,/usr/share/mailbox-convertor/lib/wstx-asl-3.2.7.jar,/usr/share/mailbox-convertor/lib/xml-apis-1.3.04.jar,/usr/share/mailbox-convertor/lib/xml-apis-ext-1.3.04.jar,/usr/share/mailbox-convertor/lib/xmlenc-0.52.jar,/usr/share/mailbox-convertor/lib/xpp3_min-1.1.3.4.O.jar -t lucehbase-emails -m 13294028075653 The tmpjars from the generated jobconf looks like this (taken from web interface): tmpjars file:/usr/share/mailbox-convertor/lib/zookeeper-3.4.2.jar,file:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar,file:/usr/share/mailbox-convertor/lib/hbase-0.92.0-1and1.jar and mapred.job.classpath.files is: /var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/zookeeper-3.4.2.jar:/var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/hadoop-core-0.20.2-cdh3u1.jar:/var/tmp/mapred/staging/hbase/.staging/job_201201271031_0027/libjars/hbase-0.92.0-1and1.jar and I get: Error: java.lang.ClassNotFoundException: org.apache.commons.lang.ArrayUtils at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at com.unitedinternet.portal.emailorganizer.TableToDictionaryMapper.map(TableToDictionaryMapper.java:31) at com.unitedinternet.portal.emailorganizer.TableToDictionaryMapper.map(TableToDictionaryMapper.java:23) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1127) at org.apache.hadoop.mapred.Child.main(Child.java:264) Regards, -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: Best Linux Operating system used for Hadoop
Pe 27.01.2012 11:15, Sujit Dhamale a scris: Hi All, I am new to Hadoop, Can any one tell me which is the best Linux Operating system used for installing& running Hadoop. ?? now a day i am using Ubuntu 11.4 and install Hadoop on it but it crashes number of times . can some please help me out ??? Kind regards Sujit Dhamale I think the most important thing you have to keep in mind is who is gong to administer your cluster. It's important that the administrator is confortable/experienced with the distribution you are going to use. As for which distribution to use, you can safely choose from one that has very good support (community and/or vendor). In no particular order: - Debian - Ubuntu LTS. - RedHad /CentOS (company/ comunity support) There are a few Hadoop distributions available from companies that you can check out: - Mapr - Cloudera's Hadoop distribution - Hortonwork (not sure about them providing a Hadoop distribution) They offer installation instructions on different platforms for their products. Maybe you can check them out to see if they are good for you. Cheers, -- Ioan Eugen Stan http://ieugen.blogspot.com
Re: missing job history and strange MR job output
Pe 13.01.2012 06:00, Harsh J a scris: Perhaps you aren't writing it properly? Its hard to tell what your problem may be without looking at some code snippets (sensitive/irrelevant parts may be cut out, or even pseudocode typed up is fine), etc.. Hello Harsh and others, It's fixed. After resolving a childish bug on my part (with building the Scan object) I still had problems with the setup. It ran everything up until waitForCompletion() where it hanged. I checked the logs and it barely showed any output from the MapReduceMini cluster. Just a few lines announcing the start of TaskTrackers and JobTrackers, etc. Removing the local maven repository finally solved the issue and now I can happily continue with coding. It seems that periodically cleaning maven repo is a must these days. Thanks for the support, -- Ioan Eugen Stan http://ieugen.blogspot.com
missing job history and strange MR job output
Hello, I'm struggling for two days now to figure out what's wrong with my map-reduce job without success. I'm trying to write a map reduce job that reads data from a HBase table and outputs to a sequence file. I'm using the HBaseTestingUtility with the Mini clusters. All things went well, but after a big re-factoring of my code I no longer get the job history and the output of my map reduce is a Sequence file that just contains the header with the class names. I've hit a brick wall and don't really know where to go right now. Cheers, p.s. if this is better on the hbase ml please let me know, but the code that reads the hbase table seems ok. I'm doing a small scan before to count some stuff to pass to the job. -- Ioan Eugen Stan http://ieugen.blogspot.com/
Re: some guidance needed
I have forwarded this discussion to my mentors so they are informed and I hope they will provide better input regarding email storage. > I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file > system, it won't give you the immediate response about the file status that > you need. I believe Google implemented Gmail with HBase. Here is an example > of implementing a mail store with Cassandra: > http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf > > <http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf>Mark Thanks Mark, I will look into that. I am currently watching. Claudera Hadoop Training [1] to get a better view of how things work. I have one question: what is the defining difference between Cassandra and HBase? Also, Eric, one of my mentors, suggested I use Gora for this and after a quick look at Gora I saw that it is an ORM for HBase and Cassandra which will allow me switch between them. The downside with this is that Gora is still incubating so a piece of advice about using it or not is welcomed. I will also ask on the Gora mailing list to see how things are there. >> I would encourage you to look at a system like HBase for your mail >> backend. HDFS doesn't work well with lots of little files, and also >> doesn't support random update, so existing formats like Maildir >> wouldn't be a good fit. I don't think I understand correctly what you mean by random updates. E-mails are immutable so once written they are not going to be updated. But if you are referring to the fact that lots of (small) files will be written in a directory and that this can be a problem then I get it. This will also mean that mailbox format (all emails in one file) will be more inappropriate than Maildir. But since e-mails are immutable and adding a mail to the mailbox means appending a small piece of data to the file this should not be a problem if Hadoop has append. The presentation on Vimeo it stated that HDFS 0.19 did not had append, I don't know yet what is the status on that, but things are a little brighter. You could have a mailbox file that could grow to a very large size. This will lead to all the users emails into one big file that is easy to manage, the only thing that it's missing is the fetching the emails. Since emails are appended to the file (inbox) as they come, and you usually are interested in the latest emails received you could just read the tail of the file and do some indexing based on that. Should I post this on the HDFS mailing-list also? I'm talking without real experience with Hadoop so shut me up if I'm wrong. >> -- >> Todd Lipcon >> Software Engineer, Cloudera You are form Cloudera, nice. Answers straight from the source :). [1] http://vimeo.com/3591321 Thanks, -- Ioan-Eugen Stan
some guidance needed
Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan