Re: Indexing on top of Hadoop
Hi, you might find some code in katta.sourceforge.net very helpful. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 10, 2009, at 5:49 AM, kartik saxena wrote: Hi, I have a huge LDIF file in order of GBs spanning some million user records. I am running the example Grep job on that file. The search results have not really been upto expectations because of it being a basic per line , brute force. I was thinking of building some indexes inside HDFS for that file , so that the search results could improve. What could I possibly try to achieve this? Secura
Re: Distributed Lucene Questions
Hi, you might want to checkout: http://katta.sourceforge.net/ Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote: Hi All, I am trying to build a distributed system to build and serve lucene indexes. I came across the Distributed Lucene project- http://wiki.apache.org/hadoop/DistributedLucene https://issues.apache.org/jira/browse/HADOOP-3394 and have a couple of questions. It will be really helpful if someone can provide some insights. 1) Is this code production ready? 2) Does someone has performance data for this project? 3) It allows searches and updates/deletes to be performed at the same time. How well the system will perform if there are frequent updates to the system. Will it handle the search and update load easily or will it be better to rebuild or update the indexes on different machines and then deploy the indexes back to the machines that are serving the indexes? Basically I am trying to choose between the 2 approaches- 1) Use Hadoop to build and/or update Lucene indexes and then deploy them on separate cluster that will take care or load balancing, fault tolerance etc. There is a package in Hadoop contrib that does this, so I can use that code. 2) Use and/or modify the Distributed Lucene code. I am expecting daily updates to our index so I am not sure if Distribtued Lucene code (which allows searches and updates on the same indexes) will be able to handle search and update load efficiently. Any suggestions ? Thanks, Tarandeep
ScaleCamp: get together the night before Hadoop Summit
Hi All, We are planing a community event the night before the Hadoop Summit. This BarCamp (http://en.wikipedia.org/wiki/BarCamp) event will be held at the same venue as the Summit (Santa Clara Marriott). Refreshments will be served to encourage socializing. To initialize conversations for the social part of the evening we are offering people the opportunity to present an experience report of their project (within a 15 min presentation). We have 12 slots in 3 parallel tracks max. The focus should be on projects leveraging technologies from the Hadoop eco-system. Please join us and mingle with the rest of the Hadoop community. To find out more about this event and signup please visit : http://www.scaleunlimited.com/events/scale_camp Please submit your presentation here: http://www.scaleunlimited.com/about-us/contact Stefan P.S. Please spread the word! P.P.S Apologies for the cross posting.
[ANNOUNCE] Katta 0.5 released
(...apologies for the cross posting...) Release 0.5 of Katta is now available. Katta - Lucene in the cloud. http://katta.sourceforge.net This release fixes bugs from 0.4, including one that sorted the results wrong under load. 0.5 also upgrades to Zookeeper to version 3.1., Lucene to version 2.4.1 and hadoop 0.19.0. The new API supports Lucene Query objects instead of just Strings, adds support for Amazon EC2, switched to Ant and Ivy as a build system and some more minor improvements. Also, we improved our online documentation and added sample code that illustrates how to create a sharded Lucene index with Hadoop. See changes at http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel Binary distribution is available at https://sourceforge.net/projects/katta/ Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
ec2 ganglia fixing missing graphs
Hi, for the mail archive... I'm using the hadoop ec2 scripts and notice that ganglia actually does not show any graphs. I was able to fix this by adding dejavu-fonts to the packages that are installed via yum in create-hadoop-image-remote.sh. The line looks now like this: yum -y install rsync lynx screen ganglia-gmetad ganglia-gmond ganglia- web dejavu-fonts httpd php Since this effects the hadoop image, it might interesting to fix this and create a new public AMI. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
contrib/ec2 USER_DATA not used
Hi, can someone tell me what the variable USER_DATA in the launch-hadoop- master is all about. I cant see that it is reused in the script or any other script. Isnt the way those parameters are passed to the nodes the USER_DATA_FILE ? The line is: USER_DATA=MASTER_HOST=master,MAX_MAP_TASKS= $MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS Any hints? Thanks, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
Re: [video] visualization of the hadoop code history
Very cool stuff, but I don't see a reference anywhere to the author of the visualization, which seems like poor form for a marketing video. I apologize if I missed a reference somewhere. Jeff, you missed it! It is the first text screen at the end of the video. It is actually a cool open source project with quite some contributors. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
Re: [video] visualization of the hadoop code history
Owen O'Malley wrote: It is interesting, but it would be more interesting to track the authors of the patch rather than the committer. The two are rarely the same. Indeed. There was a period of over a year where I wrote hardly anything but committed almost everything. So I am vastly overrepresented in commits. Thanks for the feedback. The video was rendered from the svn log file (text version). If someone has a script that clean this file up and replace the committer name with the real patch author, we are happy to render the video again. Cheers, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
New Orleans - drinks tonight
Hey, anyone early for hadoop bootcamp aswell? How about meet for drinks tonight? Send me a mail offlist... Stefan
Re: Hadoop Profiling!
Just run your map reduce job local and connect your profiler. I use yourkit. Works great! You can profile your map reduce job running the job in local mode as ant other java app as well. However we also profiled in a grid. You just need to install the yourkit agent into the jvm of the node you want to profile and than you connect to the node when the job runs. However you need to time things well, since the task jvm is shutdown as soon your job is done. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote: Hi! I've developed a Map/Reduce algorithm to analyze some logs from web application. So basically, we are ready to start QA test phase, so now, I would like to now how efficient is my application from performance point of view. So is there any procedure I could use to do some profiling? Basically I need basi data, like time excecution or code bottlenecks. Thanks in advance. -- Gerardo Velez
Re: nagios to monitor hadoop datanodes!
try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: Searching Lucene Index built using Hadoop
Hi, you might find http://katta.wiki.sourceforge.net/ interesting. If you have any katta releated question please use the katta mailing list. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:26 AM, Saranath wrote: I'm trying to index a large dataset using Hadoop+Lucene. I used the example under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way to search the index that was successfully built. I tried copying over the index to one machine and merging them using IndexWriter.addIndexesNoOptimize(). I would like hear your input on the best way to index+search large datasets. Thanks, Saranath -- View this message in context: http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Katta presentation slides
Hi All, thanks a lot for your interest. Both my katta and the hadoop survey slides can be found here: http://find23.net/2008/09/23/hadoop-user-group-slides/ If you have a chance please give katta a test drive and give us some feedback. Thanks, Stefan On Sep 23, 2008, at 6:20 PM, Rafael Turk wrote: +1 On Tue, Sep 23, 2008 at 5:14 AM, Naama Kraus [EMAIL PROTECTED] wrote: I'd be interested too. Naama On Mon, Sep 22, 2008 at 11:32 PM, Deepika Khera [EMAIL PROTECTED] wrote: Hi Stefan, Are the slides from the Katta presentation up somewhere? If not then could you please post them? Thanks, Deepika -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
[ANN] katta-0.1.0 release - distribute lucene indexes in a grid
After 5 month work we are happy to announce the first developer preview release of katta. This release contains all functionality to serve a large, sharded lucene index on many servers. Katta is standing on the shoulders of the giants lucene, hadoop and zookeeper. Main features: + Plays well with Hadoop + Apache Version 2 License. + Node failure tolerance + Master failover + Shard replication + Plug-able network topologies (Shard - Distribution and Selection Polices) + Node load balancing at client Please give katta a test drive and give us some feedback! Download: http://sourceforge.net/project/platformdownload.php?group_id=225750 website: http://katta.sourceforge.net/ Getting started in less than 3 min: http://katta.wiki.sourceforge.net/Getting+started Installation on a grid: http://katta.wiki.sourceforge.net/Installation Katta presentation today (09/17/08) at hadoop user, yahoo mission college: http://upcoming.yahoo.com/event/1075456/ * slides will be available online later Many thanks for the hard work: Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec) I apologize the cross posting. Yours, the Katta Team. ~~~ 101tec Inc., Menlo Park, California http://www.101tec.com
how to LZO
Hi, I would love to use lzo codec. However for some reasons I always only get ... INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native- hadoop library INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded initialized native-zlib library My hadoop-site looks like: property nameio.compression.codecs/name value org .apache .hadoop .io .compress .LzoCodec ,org .apache .hadoop .io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec/ value descriptionA list of the compression codec classes that can be used for compression/decompression./description /property I also think I have lzo installed on all my nodes: yum list | grep lzo lzo.x86_64 2.02-3.fc8 installed lzo.i386 2.02-3.fc8 installed lzo-devel.i386 2.02-3.fc8 fedora lzo-devel.x86_64 2.02-3.fc8 fedora lzop.x86_64 1.02-0.5.rc1.fc8 fedora Anything I miss you could think of? Thanks for any hints! Stefan
Meet Hadoop presentation: the math from page 5
Hi, I tried to better understand slide 5 of meet hadoop: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf The slide says is: given: –10MB/s transfer –10ms/seek –100B/entry (10B entries) –10kB/page (1B pages) updating 1% of entries (100M) takes: –1000 days with random B-Tree updates –100 days with batched B-Tree updates –1 day with sort merge I wonder how exactly to calculate the 1000 days and 100 days. time for seeking = 100 000 000 * lg(1 000 000 000) * 10 ms = (346.034177 days) time to read all pages = 100 000 000 * lg(1 000 000 000) * (10kB/10MB/ s) = 33.7924001 days Since we might need to write all pages again we can add another 33 days, though the result is not a 1000 days, so I do something fundamentally wrong. :o Thanks for any help... Stefan
Re: trouble setting up hadoop
Looks like you have not install a correct java. Make sure you have a sun java installed on your nodes and java is in your path as well JAVA_HOME should be set. I think gnu.gcj is the gnu java compiler but not a java you need to run hadoop. Check on command line this: $ java -version you should see something like this: java version 1.5.0_13 Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13- b05-237) Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing) HTH On Jun 23, 2008, at 9:40 PM, Sandy wrote: I apologize for the severe basicness of this error, but I am in the process of getting hadoop set up. I have been following the instructions in the Hadoop quickstart. I have confirmed that bin/hadoop will give me help usage information. I am now in the stage of standalone operation. I typed in: mkdir input cp conf/*.xml input bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' at which point I get: Exception in thread main java.lang.ClassNotFoundException: java.lang.Iterable not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../,file:/home/s jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/ sjm/Desktop/ha doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/ Desktop/hadoo p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/b in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/bin/ ../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/bin/../lib /commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../lib/j ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ jetty-5.1.4.jar, file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ junit-3.8.1.jar,file:/home/sjm/D esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/ hadoop-0.16.4 /bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../lib/ser vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ xmlenc-0.52.jar,fil e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons- el.jar,file:/home /sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- compiler.jar,file:/home/s jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- runtime.jar,file:/home/sjm/ Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar], parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}} at java.net.URLClassLoader.findClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.VMClassLoader.defineClass (libgcj.so.7) at java.lang.ClassLoader.defineClass (libgcj.so.7) at java.security.SecureClassLoader.defineClass (libgcj.so.7) at java.net.URLClassLoader.findClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at org.apache.hadoop.util.RunJar.main (RunJar.java:107) I suspect the issue is path related, though I am not certain. Could someone please point me in the right direction? Much thanks, SM ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: login error while running hadoop on MacOSX 10.5.*
Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in /sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: login error while running hadoop on MacOSX 10.5.*
The fink part and /sw confuses me. When I do a which on my os x I get: $ which whoami /usr/bin/whoami Are you using the same whoami on your console as hadoop? On Jun 23, 2008, at 10:37 PM, Lev Givon wrote: Both the daemons and the job were started using the same user. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT: Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in / sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: login error while running hadoop on MacOSX 10.5.*
Sorry I'm not a unix expert, however the problem is clearly related to whoami since this throws an error. I run hadoop in all kind of configuration super smooth on my os x boxes. Maybe rename or move /sw/whoami for a test. Also make sure you restart the os x console since changes in .bash_profile are only picked up if you relogin into the command line. Sorry that is all I know and could guess.. :( On Jun 23, 2008, at 10:56 PM, Lev Givon wrote: Yes; I have my PATH configured to list /sw/bin before /usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I set PATH to /usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/ bin:/usr/local/bin before starting the daemons and attempting to run the job. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT: The fink part and /sw confuses me. When I do a which on my os x I get: $ which whoami /usr/bin/whoami Are you using the same whoami on your console as hadoop? On Jun 23, 2008, at 10:37 PM, Lev Givon wrote: Both the daemons and the job were started using the same user. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT: Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in / sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/ whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: [some bugs] Re: file permission problem
Great - it is even alrady fixed in 16.1! Thanks for the hint! Stefan On Mar 14, 2008, at 2:49 PM, Andy Li wrote: I think this is the same problem related to this mail thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html A JIRA has been filed, please see HADOOP-2915. On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf [EMAIL PROTECTED] wrote: Hi, any magic we can do with hadoop.dfs.umask? Or is there any other off switch for the file security? Thanks. Stefan On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote: Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALLdir 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release ( http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with tsz 2) run a job with nicholas The output directory and files are owned by nicholas. Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The cluster-user is different to the laptop-user. As output i specify a directory inside the users home. This output directory, created through the map-reduce job has cluster-user permissions, so this does not allow me to move or delete the output folder with my laptop-user. So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser(hadoop) on the client side - System.setProperty(user.name,hadoop) before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
[memory leak?] Re: MapReduce failure
Hi there, we see the same situation and browsing the posts there are quite a lot of people running into this OOM problem. We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I think that should be more then enough. Also I changed io.sort.mb to 10, which had also no impact. Any ideas what might cause the OutOfMemoryError ? Thanks. Stefan On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote: What is the heap size you are using for your tasks? Check 'mapred.child.java.opts' in your hadoop-default.xml. Try increasing it. This will happen if you try running the random-writer + sort examples with default parameters. The maps are not able to spill the data to the disk. Btw what version of HADOOP are you using? Amar On Mon, 10 Mar 2008, Ved Prakash wrote: Hi friends, I have made a cluster of 3 machines, one of them is master, and other 2 slaves. I executed a mapreduce job on master but after Map, the execution terminates and Reduce doesn't happen. I have checked dfs and no output folder gets created. this is the error I see 08/03/10 10:35:21 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_64_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) 08/03/10 10:35:22 INFO mapred.JobClient: map 55% reduce 17% 08/03/10 10:35:31 INFO mapred.JobClient: map 56% reduce 17% 08/03/10 10:35:51 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:04 INFO mapred.JobClient: map 58% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_71_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) though it tries to overcome this problem but the mapreduce application doesn't create output, can anyone tell me why is this happening? Thanks ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
[some bugs] Re: file permission problem
Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL dir 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with tsz 2) run a job with nicholas The output directory and files are owned by nicholas. Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The cluster-user is different to the laptop-user. As output i specify a directory inside the users home. This output directory, created through the map-reduce job has cluster-user permissions, so this does not allow me to move or delete the output folder with my laptop-user. So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser(hadoop) on the client side - System.setProperty(user.name,hadoop) before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: [some bugs] Re: file permission problem
Hi, any magic we can do with hadoop.dfs.umask? Or is there any other off switch for the file security? Thanks. Stefan On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote: Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL dir 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 r 3 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with tsz 2) run a job with nicholas The output directory and files are owned by nicholas. Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The cluster-user is different to the laptop-user. As output i specify a directory inside the users home. This output directory, created through the map-reduce job has cluster-user permissions, so this does not allow me to move or delete the output folder with my laptop-user. So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser(hadoop) on the client side - System.setProperty(user.name,hadoop) before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Hadoop summit / workshop at Yahoo!
Puhh, 2 days and it is full? Does Yahoo have no bigger rooms than just for a 100 people? On Feb 20, 2008, at 12:10 PM, Ajay Anand wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are interested in attending. About the summit: Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform. Agenda: 8:30-8:55 Breakfast 8:55-9:00 Welcome to Yahoo! Logistics - Ajay Anand, Yahoo! 9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo! 9:30-10:00 Pig - Chris Olston, Yahoo! 10:00-10:30 JAQL - Kevin Beyer, IBM 10:30-10:45 Break 10:45-11:15 DryadLINQ - Michael Isard, Microsoft 11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei Zaharia, UC Berkeley 11:45-12:15 Zookeeper - Ben Reed, Yahoo! 12:15-1:15 Lunch 1:15-1:45 Hbase - Michael Stack, Powerset 1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf 2:15-2:45 Hive - Joydeep Sen Sarma, Facebook 2:45-3:00 Break 3:00-3:20 Building Ground Models of Southern California - Steve Schossler, David O'Hallaron, Intel / CMU 3:20-3:40 Online search for engineering design content - Mike Haley, Autodesk 3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo! 4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland / Christophe Bisciglia, Google 4:30-4:45 Break 4:45-5:30 Panel on future directions 5:30-7:00 Happy hour Look forward to seeing you there! Ajay -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 20, 2008 9:17 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Hrm yes, I'd like to make a visit as well :) On Feb 20, 2008 8:05 AM, C G [EMAIL PROTECTED] wrote: Hey All: Is this going forward? I'd like to make plans to attend and the sooner I can get plane tickets the happier the bean counters will be :-). Thx, C G Ajay Anand wrote: Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like to cover topics in the areas of extensions being developed for Hadoop, innovative applications being built and deployed on Hadoop, and future extensions to the platform. Some of the speakers who have already committed to present are from organizations such as IBM, Intel, Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and we are actively recruiting other leaders in the space. If you have an innovative application you would like to talk about, please let us know. Although there are limitations on the amount of time we have, we would love to hear from you. You can contact me at [EMAIL PROTECTED] Thanks and looking forward to hearing about your cool apps, Ajay -- View this message in context: http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15 393386.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
broadcasting: pig user meeting, Friday, February 8, 2008
Hi there, sorry for cross posting. If everything works out we will video broadcast the event here: http://ustream.tv/channel/apache-pig-user-meeting But no guarantee - sorry. Also we try to setup a telefon voice call in number - please write me a private email if you are interested and I will send out a number. See you tomorrow. Stefan On Feb 6, 2008, at 3:54 PM, Andrzej Bialecki wrote: Otis Gospodnetic wrote: Sorry about the word-wrapping (original email) - Yahoo Mail problem :( Is anyone going to be capturing the Piglet meeting on video for the those of us living in other corners of the planet? Please do! It's too far from Poland to just casually drop by .. ;) -- Best regards, Andrzej Bialecki ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: pig user meeting, Friday, February 8, 2008
Hi Otis, can you suggest a technology how we could do that? Skype? Ichat? Something that is free? I'm happy setup a video conf, however there are no big presentations planed. I was thinking I can give a overview how we use pig for our current project just to reflect our use cases. But beside that I guess it is just pizza and beer. Cheers, Stefan On Feb 6, 2008, at 11:40 AM, Otis Gospodnetic wrote: Sorry about the word-wrapping (original email) - Yahoo Mail problem :( Is anyone going to be capturing the Piglet meeting on video for the those of us living in other corners of the planet? Thank you, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Stefan Groschupf [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, January 31, 2008 7:09:53 PM Subject: pig user meeting, Friday, February 8, 2008 Hi there, a couple of people plan to meet and talk about apache pig next Friday in the Mountain View area. (Event location is not yet sure). If you are interested please RSVP asap, so we can plan what kind of location size we looking for. http://upcoming.yahoo.com/event/420958/ Cheers, Stefan ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com