Re: Indexing on top of Hadoop
Hi, you might find some code in katta.sourceforge.net very helpful. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 10, 2009, at 5:49 AM, kartik saxena wrote: Hi, I have a huge LDIF file in order of GBs spanning some million user records. I am running the example "Grep" job on that file. The search results have not really been upto expectations because of it being a basic per line , brute force. I was thinking of building some indexes inside HDFS for that file , so that the search results could improve. What could I possibly try to achieve this? Secura
Re: Distributed Lucene Questions
Hi, you might want to checkout: http://katta.sourceforge.net/ Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com On Jun 1, 2009, at 9:54 AM, Tarandeep Singh wrote: Hi All, I am trying to build a distributed system to build and serve lucene indexes. I came across the Distributed Lucene project- http://wiki.apache.org/hadoop/DistributedLucene https://issues.apache.org/jira/browse/HADOOP-3394 and have a couple of questions. It will be really helpful if someone can provide some insights. 1) Is this code production ready? 2) Does someone has performance data for this project? 3) It allows searches and updates/deletes to be performed at the same time. How well the system will perform if there are frequent updates to the system. Will it handle the search and update load easily or will it be better to rebuild or update the indexes on different machines and then deploy the indexes back to the machines that are serving the indexes? Basically I am trying to choose between the 2 approaches- 1) Use Hadoop to build and/or update Lucene indexes and then deploy them on separate cluster that will take care or load balancing, fault tolerance etc. There is a package in Hadoop contrib that does this, so I can use that code. 2) Use and/or modify the Distributed Lucene code. I am expecting daily updates to our index so I am not sure if Distribtued Lucene code (which allows searches and updates on the same indexes) will be able to handle search and update load efficiently. Any suggestions ? Thanks, Tarandeep
ScaleCamp: get together the night before Hadoop Summit
Hi All, We are planing a community event the night before the Hadoop Summit. This "BarCamp" (http://en.wikipedia.org/wiki/BarCamp) event will be held at the same venue as the Summit (Santa Clara Marriott). Refreshments will be served to encourage socializing. To initialize conversations for the social part of the evening we are offering people the opportunity to present an experience report of their project (within a 15 min presentation). We have 12 slots in 3 parallel tracks max. The focus should be on projects leveraging technologies from the Hadoop eco-system. Please join us and mingle with the rest of the Hadoop community. To find out more about this event and signup please visit : http://www.scaleunlimited.com/events/scale_camp Please submit your presentation here: http://www.scaleunlimited.com/about-us/contact Stefan P.S. Please spread the word! P.P.S Apologies for the cross posting.
[ANNOUNCE] Katta 0.5 released
(...apologies for the cross posting...) Release 0.5 of Katta is now available. Katta - Lucene in the cloud. http://katta.sourceforge.net This release fixes bugs from 0.4, including one that sorted the results wrong under load. 0.5 also upgrades to Zookeeper to version 3.1., Lucene to version 2.4.1 and hadoop 0.19.0. The new API supports Lucene Query objects instead of just Strings, adds support for Amazon EC2, switched to Ant and Ivy as a build system and some more minor improvements. Also, we improved our online documentation and added sample code that illustrates how to create a sharded Lucene index with Hadoop. See changes at http://oss.101tec.com/jira/browse/KATTA?report=com.atlassian.jira.plugin.system.project:changelog-panel Binary distribution is available at https://sourceforge.net/projects/katta/ Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
ec2 ganglia fixing missing graphs
Hi, for the mail archive... I'm using the hadoop ec2 scripts and notice that ganglia actually does not show any graphs. I was able to fix this by adding dejavu-fonts to the packages that are installed via yum in create-hadoop-image-remote.sh. The line looks now like this: yum -y install rsync lynx screen ganglia-gmetad ganglia-gmond ganglia- web dejavu-fonts httpd php Since this effects the hadoop image, it might interesting to fix this and create a new public AMI. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
contrib/ec2 USER_DATA not used
Hi, can someone tell me what the variable USER_DATA in the launch-hadoop- master is all about. I cant see that it is reused in the script or any other script. Isnt the way those parameters are passed to the nodes the USER_DATA_FILE ? The line is: USER_DATA="MASTER_HOST=master,MAX_MAP_TASKS= $MAX_MAP_TASKS,MAX_REDUCE_TASKS=$MAX_REDUCE_TASKS,COMPRESS=$COMPRESS" Any hints? Thanks, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
Re: [video] visualization of the hadoop code history
Very cool stuff, but I don't see a reference anywhere to the author of the visualization, which seems like poor form for a marketing video. I apologize if I missed a reference somewhere. Jeff, you missed it! It is the first text screen at the end of the video. It is actually a cool open source project with quite some contributors. Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
Re: [video] visualization of the hadoop code history
Owen O'Malley wrote: It is interesting, but it would be more interesting to track the authors of the patch rather than the committer. The two are rarely the same. Indeed. There was a period of over a year where I wrote hardly anything but committed almost everything. So I am vastly overrepresented in commits. Thanks for the feedback. The video was rendered from the svn log file (text version). If someone has a script that clean this file up and replace the committer name with the real patch author, we are happy to render the video again. Cheers, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com http://www.101tec.com
[video] visualization of the hadoop code history
Hi friends of Hadoop, we from ScaleUnlimited.com put together a video that visualize the code commit history of the Hadoop core project. It is a neat way of visualizing who is behind the Hadoop source code and how the project code base grew over the years. Check it out here: http://www.scaleunlimited.com/hadoop-resources.html Best, Stefan ~~~ Hadoop training and consulting http://www.scaleunlimited.com
mbox archive files for hadoop mailing lists.
Hi, where do I can find the mbox mailing list archive files for the hadoop user mailing lists? Thanks, Stefan
New Orleans - drinks tonight
Hey, anyone early for hadoop bootcamp aswell? How about meet for drinks tonight? Send me a mail offlist... Stefan
Re: Hadoop Profiling!
Just run your map reduce job local and connect your profiler. I use yourkit. Works great! You can profile your map reduce job running the job in local mode as ant other java app as well. However we also profiled in a grid. You just need to install the yourkit agent into the jvm of the node you want to profile and than you connect to the node when the job runs. However you need to time things well, since the task jvm is shutdown as soon your job is done. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 8, 2008, at 11:27 AM, Gerardo Velez wrote: Hi! I've developed a Map/Reduce algorithm to analyze some logs from web application. So basically, we are ready to start QA test phase, so now, I would like to now how efficient is my application from performance point of view. So is there any procedure I could use to do some profiling? Basically I need basi data, like time excecution or code bottlenecks. Thanks in advance. -- Gerardo Velez
Re: nagios to monitor hadoop datanodes!
try jmx. There should be also jmx to snmp available somewhere. http://blogs.sun.com/jmxetc/entry/jmx_vs_snmp ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:05 AM, Gerardo Velez wrote: Hi Everyone! I would like to implement Nagios health monitoring of a Hadoop grid. Some of you have some experience here, do you hace any approach or advice I could use. At this time I've been only playing with jsp's files that hadoop has integrated into it. so I;m not sure if it could be a good idea that nagios monitor request info to these jsp? Thanks in advance! -- Gerardo
Re: Searching Lucene Index built using Hadoop
Hi, you might find http://katta.wiki.sourceforge.net/ interesting. If you have any katta releated question please use the katta mailing list. Stefan ~~~ 101tec Inc., Menlo Park, California web: http://www.101tec.com blog: http://www.find23.net On Oct 6, 2008, at 10:26 AM, Saranath wrote: I'm trying to index a large dataset using Hadoop+Lucene. I used the example under hadoop/trunk/src/conrib/index/ for indexing. I'm unable to find a way to search the index that was successfully built. I tried copying over the index to one machine and merging them using IndexWriter.addIndexesNoOptimize(). I would like hear your input on the best way to index+search large datasets. Thanks, Saranath -- View this message in context: http://www.nabble.com/Searching-Lucene-Index-built-using-Hadoop-tp19842438p19842438.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Re: Katta presentation slides
Hi All, thanks a lot for your interest. Both my katta and the hadoop survey slides can be found here: http://find23.net/2008/09/23/hadoop-user-group-slides/ If you have a chance please give katta a test drive and give us some feedback. Thanks, Stefan On Sep 23, 2008, at 6:20 PM, Rafael Turk wrote: +1 On Tue, Sep 23, 2008 at 5:14 AM, Naama Kraus <[EMAIL PROTECTED]> wrote: I'd be interested too. Naama On Mon, Sep 22, 2008 at 11:32 PM, Deepika Khera <[EMAIL PROTECTED] wrote: Hi Stefan, Are the slides from the Katta presentation up somewhere? If not then could you please post them? Thanks, Deepika -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
[ANN] katta-0.1.0 release - distribute lucene indexes in a grid
After 5 month work we are happy to announce the first developer preview release of katta. This release contains all functionality to serve a large, sharded lucene index on many servers. Katta is standing on the shoulders of the giants lucene, hadoop and zookeeper. Main features: + Plays well with Hadoop + Apache Version 2 License. + Node failure tolerance + Master failover + Shard replication + Plug-able network topologies (Shard - Distribution and Selection Polices) + Node load balancing at client Please give katta a test drive and give us some feedback! Download: http://sourceforge.net/project/platformdownload.php?group_id=225750 website: http://katta.sourceforge.net/ Getting started in less than 3 min: http://katta.wiki.sourceforge.net/Getting+started Installation on a grid: http://katta.wiki.sourceforge.net/Installation Katta presentation today (09/17/08) at hadoop user, yahoo mission college: http://upcoming.yahoo.com/event/1075456/ * slides will be available online later Many thanks for the hard work: Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec) I apologize the cross posting. Yours, the Katta Team. ~~~ 101tec Inc., Menlo Park, California http://www.101tec.com
how to LZO
Hi, I would love to use lzo codec. However for some reasons I always only get ... "INFO org.apache.hadoop.util.NativeCodeLoader: Loaded the native- hadoop library INFO org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully loaded & initialized native-zlib library" My hadoop-site looks like: io.compression.codecs < value > org .apache .hadoop .io .compress .LzoCodec ,org .apache .hadoop .io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodecvalue> A list of the compression codec classes that can be used for compression/decompression. I also think I have lzo installed on all my nodes: yum list | grep lzo lzo.x86_64 2.02-3.fc8 installed lzo.i386 2.02-3.fc8 installed lzo-devel.i386 2.02-3.fc8 fedora lzo-devel.x86_64 2.02-3.fc8 fedora lzop.x86_64 1.02-0.5.rc1.fc8 fedora Anything I miss you could think of? Thanks for any hints! Stefan
Re: login error while running hadoop on MacOSX 10.5.*
Sorry I'm not a unix expert, however the problem is clearly related to whoami since this throws an error. I run hadoop in all kind of configuration super smooth on my os x boxes. Maybe rename or move /sw/whoami for a test. Also make sure you restart the os x console since changes in .bash_profile are only picked up if you "relogin" into the command line. Sorry that is all I know and could guess.. :( On Jun 23, 2008, at 10:56 PM, Lev Givon wrote: Yes; I have my PATH configured to list /sw/bin before /usr/bin. Curiously, hadoop tries to invoke /sw/bin/whoami even when I set PATH to /usr/bin:/bin:/usr/sbin:/sbin:/usr/local/bin:/usr/X11/bin:/usr/X11R6/ bin:/usr/local/bin before starting the daemons and attempting to run the job. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:49:23PM EDT: The fink part and /sw confuses me. When I do a which on my os x I get: $ which whoami /usr/bin/whoami Are you using the same whoami on your console as hadoop? On Jun 23, 2008, at 10:37 PM, Lev Givon wrote: Both the daemons and the job were started using the same user. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT: Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in / sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/ whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: realtime hadoop
Hadoop might be the wrong technology for you. Map Reduce is a batch processing mechanism. Also HDFS might be critical since to access your data you need to close the file - means you might have many small file, a situation where hdfs is not very strong (namespace is hold in memory). Hbase might be an interesting tool for you, also zookeeper if you want to do something home grown... On Jun 23, 2008, at 11:31 PM, Vadim Zaliva wrote: Hi! I am considering using Hadoop for (almost) realime data processing. I have data coming every second and I would like to use hadoop cluster to process it as fast as possible. I need to be able to maintain some guaranteed max. processing time, for example under 3 minutes. Does anybody have experience with using Hadoop in such manner? I will appreciate if you can share your experience or give me pointers to some articles or pages on the subject. Vadim ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: login error while running hadoop on MacOSX 10.5.*
The fink part and /sw confuses me. When I do a which on my os x I get: $ which whoami /usr/bin/whoami Are you using the same whoami on your console as hadoop? On Jun 23, 2008, at 10:37 PM, Lev Givon wrote: Both the daemons and the job were started using the same user. L.G. Received from Stefan Groschupf on Mon, Jun 23, 2008 at 04:34:54PM EDT: Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in / sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: login error while running hadoop on MacOSX 10.5.*
Which user runs the hadoop? It should be the same you trigger the job with. On Jun 23, 2008, at 10:29 PM, Lev Givon wrote: I recently installed hadoop 0.17.0 in pseudo-distributed mode on a MacOSX 10.5.3 system with software managed by Fink installed in /sw. I configured hadoop to use the stock Java 1.5.0_13 installation in /Library/Java/Home. When I attempted to run a simple map/reduce job off of the dfs after starting up the daemons, the job failed with the following Java error (501 is the ID of the user used to start the hadoop daemons and run the map/reduce job): javax.security.auth.login.LoginException: Login failed: /sw/bin/whoami: cannot find name for user ID 501 What might be causing this to occur? Manually running /sw/bin/whoami as the user in question returns the corresponding username. L.G. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: trouble setting up hadoop
Looks like you have not install a correct java. Make sure you have a sun java installed on your nodes and java is in your path as well JAVA_HOME should be set. I think gnu.gcj is the gnu java compiler but not a java you need to run hadoop. Check on command line this: $ java -version you should see something like this: java version "1.5.0_13" Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13- b05-237) Java HotSpot(TM) Client VM (build 1.5.0_13-119, mixed mode, sharing) HTH On Jun 23, 2008, at 9:40 PM, Sandy wrote: I apologize for the severe basicness of this error, but I am in the process of getting hadoop set up. I have been following the instructions in the Hadoop quickstart. I have confirmed that bin/hadoop will give me help usage information. I am now in the stage of standalone operation. I typed in: mkdir input cp conf/*.xml input bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+' at which point I get: Exception in thread "main" java.lang.ClassNotFoundException: java.lang.Iterable not found in gnu.gcj.runtime.SystemClassLoader{urls=[file:/home/sjm/Desktop/hado op-0.16.4/bin/../conf/,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../,file:/home/s jm/Desktop/hadoop-0.16.4/bin/../hadoop-0.16.4-core.jar,file:/home/ sjm/Desktop/ha doop-0.16.4/bin/../lib/commons-cli-2.0-SNAPSHOT.jar,file:/home/sjm/ Desktop/hadoo p-0.16.4/bin/../lib/commons-codec-1.3.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/b in/../lib/commons-httpclient-3.0.1.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/bin/ ../lib/commons-logging-1.0.4.jar,file:/home/sjm/Desktop/ hadoop-0.16.4/bin/../lib /commons-logging-api-1.0.4.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../lib/j ets3t-0.5.0.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ jetty-5.1.4.jar, file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ junit-3.8.1.jar,file:/home/sjm/D esktop/hadoop-0.16.4/bin/../lib/kfs-0.1.jar,file:/home/sjm/Desktop/ hadoop-0.16.4 /bin/../lib/log4j-1.2.13.jar,file:/home/sjm/Desktop/hadoop-0.16.4/ bin/../lib/ser vlet-api.jar,file:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/ xmlenc-0.52.jar,fil e:/home/sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/commons- el.jar,file:/home /sjm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- compiler.jar,file:/home/s jm/Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jasper- runtime.jar,file:/home/sjm/ Desktop/hadoop-0.16.4/bin/../lib/jetty-ext/jsp-api.jar], parent=gnu.gcj.runtime. ExtensionClassLoader{urls=[], parent=null}} at java.net.URLClassLoader.findClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.VMClassLoader.defineClass (libgcj.so.7) at java.lang.ClassLoader.defineClass (libgcj.so.7) at java.security.SecureClassLoader.defineClass (libgcj.so.7) at java.net.URLClassLoader.findClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at java.lang.ClassLoader.loadClass (libgcj.so.7) at org.apache.hadoop.util.RunJar.main (RunJar.java:107) I suspect the issue is path related, though I am not certain. Could someone please point me in the right direction? Much thanks, SM ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Working with XML / XQuery in hadoop
Yep, we do. We have a xml Writable that uses XUM behind the scene. This has a getDom and getNode(xquery) method. In readIn we read the byte array and create the xum dom object from the byte array. Write simply triggers the BinaryCodec.serialize and we write the bytes out. However the same would work if you de/serialize xml as text, though we found that is slower than xum, though works pretty stable, since xum has other issues (you need to use BinaryCodex as jvm sigelton etc). However in general this works pretty well. Stefan On Jun 23, 2008, at 9:38 PM, Kayla Jay wrote: Hi Just wondering if anyone out there works with and manipulates and stores XML data using Hadoop? I've seen some threads about XML RecordReaders and people who use that XML StreamXmlRecordReader to do splits. But, has anyone implemented a query framework that will use the hadoop layer to query against the XML in their map/reduce jobs? I want to know if anyone has done an XQuery or XPath executed within a haoop job to find something within the XML stored in hadoop? I can't find any samples or anyone else out there who uses XML data vs. traditional log text data. Are there any use cases of using hadoop to work with XML and then do queries against XML in a distributed manner using hadoop? Thanks. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Meet Hadoop presentation: the math from page 5
Hi, I tried to better understand slide 5 of "meet hadoop": http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/oscon-part-1.pdf The slide says is: given: –10MB/s transfer –10ms/seek –100B/entry (10B entries) –10kB/page (1B pages) updating 1% of entries (100M) takes: –1000 days with random B-Tree updates –100 days with batched B-Tree updates –1 day with sort & merge I wonder how exactly to calculate the 1000 days and 100 days. time for seeking = 100 000 000 * lg(1 000 000 000) * 10 ms = (346.034177 days) time to read all pages = 100 000 000 * lg(1 000 000 000) * (10kB/10MB/ s) = 33.7924001 days Since we might need to write all pages again we can add another 33 days, though the result is not a 1000 days, so I do something fundamentally wrong. :o Thanks for any help... Stefan
Re: [memory leak?] Re: MapReduce failure
ups sorry I forgot to mention I use 0.16.0. I will try to update to 16.1 tomorrow and see if this helps, but i couldn't find an closed issue in jira that might be related. On Mar 15, 2008, at 8:37 PM, Stefan Groschupf wrote: Hi there, we see the same situation and browsing the posts there are quite a lot of people running into this OOM problem. We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I think that should be more then enough. Also I changed io.sort.mb to 10, which had also no impact. Any ideas what might cause the OutOfMemoryError ? Thanks. Stefan On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote: What is the heap size you are using for your tasks? Check 'mapred.child.java.opts' in your hadoop-default.xml. Try increasing it. This will happen if you try running the random-writer + sort examples with default parameters. The maps are not able to spill the data to the disk. Btw what version of HADOOP are you using? Amar On Mon, 10 Mar 2008, Ved Prakash wrote: Hi friends, I have made a cluster of 3 machines, one of them is master, and other 2 slaves. I executed a mapreduce job on master but after Map, the execution terminates and Reduce doesn't happen. I have checked dfs and no output folder gets created. this is the error I see 08/03/10 10:35:21 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_64_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) 08/03/10 10:35:22 INFO mapred.JobClient: map 55% reduce 17% 08/03/10 10:35:31 INFO mapred.JobClient: map 56% reduce 17% 08/03/10 10:35:51 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:04 INFO mapred.JobClient: map 58% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_71_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) though it tries to overcome this problem but the mapreduce application doesn't create output, can anyone tell me why is this happening? Thanks ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: [memory leak?] Re: MapReduce failure
I do not instantiate 3 GB of objects, that is for sure. The wordcount example does not run anymore so I dont think this is something releated to my code, beside the wordcount example many other users report the same problem: See: http://markmail.org/search/?q=org.apache.hadoop.mapred.MapTask%24MapOutputBuffer.collect+order%3Adate-backward Thanks for your help! Stefan On Mar 15, 2008, at 11:02 PM, Devaraj Das wrote: It might have something to do with your application itself. By any chance are you doing a lot of huge object allocation (directly or indirectly) within the map method? Which version of hadoop are you on? -Original Message- From: Stefan Groschupf [mailto:[EMAIL PROTECTED] Sent: Sunday, March 16, 2008 9:07 AM To: core-user@hadoop.apache.org Subject: [memory leak?] Re: MapReduce failure Hi there, we see the same situation and browsing the posts there are quite a lot of people running into this OOM problem. We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I think that should be more then enough. Also I changed io.sort.mb to 10, which had also no impact. Any ideas what might cause the OutOfMemoryError ? Thanks. Stefan On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote: What is the heap size you are using for your tasks? Check 'mapred.child.java.opts' in your hadoop-default.xml. Try increasing it. This will happen if you try running the random-writer + sort examples with default parameters. The maps are not able to spill the data to the disk. Btw what version of HADOOP are you using? Amar On Mon, 10 Mar 2008, Ved Prakash wrote: Hi friends, I have made a cluster of 3 machines, one of them is master, and other 2 slaves. I executed a mapreduce job on master but after Map, the execution terminates and Reduce doesn't happen. I have checked dfs and no output folder gets created. this is the error I see 08/03/10 10:35:21 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_64_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) 08/03/10 10:35:22 INFO mapred.JobClient: map 55% reduce 17% 08/03/10 10:35:31 INFO mapred.JobClient: map 56% reduce 17% 08/03/10 10:35:51 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:04 INFO mapred.JobClient: map 58% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_71_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) though it tries to overcome this problem but the mapreduce application doesn't create output, can anyone tell me why is this happening? Thanks ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
[memory leak?] Re: MapReduce failure
Hi there, we see the same situation and browsing the posts there are quite a lot of people running into this OOM problem. We run a own Mapper and our mapred.child.java.opts is -Xmx3048m, I think that should be more then enough. Also I changed io.sort.mb to 10, which had also no impact. Any ideas what might cause the OutOfMemoryError ? Thanks. Stefan On Mar 9, 2008, at 10:28 PM, Amar Kamat wrote: What is the heap size you are using for your tasks? Check 'mapred.child.java.opts' in your hadoop-default.xml. Try increasing it. This will happen if you try running the random-writer + sort examples with default parameters. The maps are not able to spill the data to the disk. Btw what version of HADOOP are you using? Amar On Mon, 10 Mar 2008, Ved Prakash wrote: Hi friends, I have made a cluster of 3 machines, one of them is master, and other 2 slaves. I executed a mapreduce job on master but after Map, the execution terminates and Reduce doesn't happen. I have checked dfs and no output folder gets created. this is the error I see 08/03/10 10:35:21 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_64_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) 08/03/10 10:35:22 INFO mapred.JobClient: map 55% reduce 17% 08/03/10 10:35:31 INFO mapred.JobClient: map 56% reduce 17% 08/03/10 10:35:51 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:04 INFO mapred.JobClient: map 58% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: map 57% reduce 17% 08/03/10 10:36:07 INFO mapred.JobClient: Task Id : task_200803101001_0001_m_71_0, Status : FAILED java.lang.OutOfMemoryError: Java heap space at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java :95) at java.io.DataOutputStream.write(DataOutputStream.java:90) at org.apache.hadoop.io.Text.write(Text.java:243) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect( MapTask.java:347) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :72) at org.apache.hadoop.examples.WordCount $MapClass.map(WordCount.java :59) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192) at org.apache.hadoop.mapred.TaskTracker $Child.main(TaskTracker.java :1787) though it tries to overcome this problem but the mapreduce application doesn't create output, can anyone tell me why is this happening? Thanks ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: [some bugs] Re: file permission problem
Great - it is even alrady fixed in 16.1! Thanks for the hint! Stefan On Mar 14, 2008, at 2:49 PM, Andy Li wrote: I think this is the same problem related to this mail thread. http://www.mail-archive.com/[EMAIL PROTECTED]/msg02759.html A JIRA has been filed, please see HADOOP-2915. On Fri, Mar 14, 2008 at 2:08 AM, Stefan Groschupf <[EMAIL PROTECTED]> wrote: Hi, any magic we can do with hadoop.dfs.umask? Or is there any other off switch for the file security? Thanks. Stefan On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote: Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-027311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release ( http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with "tsz" 2) run a job with "nicholas" The output directory and files are owned by "nicholas". Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The "cluster-user" is different to the "laptop-user". As output i specify a directory inside the users home. This output directory, created through the map-reduce job has "cluster-user" permissions, so this does not allow me to move or delete the output folder with my "laptop-user". So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser("hadoop") on the client side - System.setProperty("user.name","hadoop") before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: [some bugs] Re: file permission problem
Hi, any magic we can do with hadoop.dfs.umask? Or is there any other off switch for the file security? Thanks. Stefan On Mar 13, 2008, at 11:26 PM, Stefan Groschupf wrote: Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with "tsz" 2) run a job with "nicholas" The output directory and files are owned by "nicholas". Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The "cluster-user" is different to the "laptop-user". As output i specify a directory inside the users home. This output directory, created through the map-reduce job has "cluster-user" permissions, so this does not allow me to move or delete the output folder with my "laptop-user". So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser("hadoop") on the client side - System.setProperty("user.name","hadoop") before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
[some bugs] Re: file permission problem
Hi Nicholas, Hi All, I definitely can reproduce the problem Johannes describes. Also from debugging through the code it is clearly a bug from my point of view. So this is the call stack: SequenceFile.createWriter FileSystem.create DFSClient.create namenode.create In NameNode I found this: namesystem.startFile(src, new PermissionStatus(Server.getUserInfo().getUserName(), null, masked), clientName, clientMachine, overwrite, replication, blockSize); In getUserInfo is this comment: // This is to support local calls (as opposed to rpc ones) to the name-node. // Currently it is name-node specific and should be placed somewhere else. try { return UnixUserGroupInformation.login(); The login javaDoc says: /** * Get current user's name and the names of all its groups from Unix. * It's assumed that there is only one UGI per user. If this user already * has a UGI in the ugi map, return the ugi in the map. * Otherwise get the current user's information from Unix, store it * in the map, and return it. */ Beside of that I had some interesting observations. If I have permissions to write to a folder A I can delete folder A and file B that is inside of folder A even if I do have no permissions for B. Also I noticed following in my dfs [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598 Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL 2008-03-13 16:00 rwxr-xr-x hadoop supergroup [EMAIL PROTECTED] hadoop]$ bin/hadoop fs -ls /user/joa23/ myApp-1205474968598/VOICE_CALL Found 1 items /user/joa23/myApp-1205474968598/VOICE_CALL/part-0 27311 2008-03-13 16:00 rw-r--r-- joa23 supergroup Do I miss something or was I able to write as user joa23 into a folder owned by hadoop where I should have no permissions. :-O. Should I open some jira issues? Stefan On Mar 13, 2008, at 10:55 AM, [EMAIL PROTECTED] wrote: Hi Johannes, i'm using the 0.16.0 distribution. I assume you mean the 0.16.0 release (http://hadoop.apache.org/core/releases.html ) without any additional patch. I just have tried it but cannot reproduce the problem you described. I did the following: 1) start a cluster with "tsz" 2) run a job with "nicholas" The output directory and files are owned by "nicholas". Am I doing the same thing you did? Could you try again? Nicholas - Original Message From: Johannes Zillmann <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Wednesday, March 12, 2008 5:47:27 PM Subject: file permission problem Hi, i have a question regarding the file permissions. I have a kind of workflow where i submit a job from my laptop to a remote hadoop cluster. After the job finished i do some file operations on the generated output. The "cluster-user" is different to the "laptop-user". As output i specify a directory inside the users home. This output directory, created through the map-reduce job has "cluster-user" permissions, so this does not allow me to move or delete the output folder with my "laptop-user". So it looks as follow: /user/jz/ rwxrwxrwx jzsupergroup /user/jz/output rwxr-xr-xhadoopsupergroup I tried different things to achieve what i want (moving/deleting the output folder): - jobConf.setUser("hadoop") on the client side - System.setProperty("user.name","hadoop") before jobConf instantiation on the client side - add user.name node in the hadoop-site.xml on the client side - setPermision(777) on the home folder on the client side (does not work recursiv) - setPermision(777) on the output folder on the client side (permission denied) - create the output folder before running the job (Output directory already exists exception) None of the things i tried worked. Is there a way to achieve what i want ? Any ideas appreciated! cheers Johannes -- ~~~ 101tec GmbH Halle (Saale), Saxony-Anhalt, Germany http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Hadoop summit / workshop at Yahoo!
Puhh, 2 days and it is full? Does Yahoo have no bigger rooms than just for a 100 people? On Feb 20, 2008, at 12:10 PM, Ajay Anand wrote: The registration page for the Hadoop summit is now up: http://developer.yahoo.com/hadoop/summit/ Space is limited, so please sign up early if you are interested in attending. About the summit: Yahoo! is hosting the first summit on Apache Hadoop on March 25th in Sunnyvale. The summit is sponsored by the Computing Community Consortium (CCC) and brings together leaders from the Hadoop developer and user communities. The speakers will cover topics in the areas of extensions being developed for Hadoop, case studies of applications being built and deployed on Hadoop, and a discussion on future directions for the platform. Agenda: 8:30-8:55 Breakfast 8:55-9:00 Welcome to Yahoo! & Logistics - Ajay Anand, Yahoo! 9:00-9:30 Hadoop Overview - Doug Cutting / Eric Baldeschwieler, Yahoo! 9:30-10:00 Pig - Chris Olston, Yahoo! 10:00-10:30 JAQL - Kevin Beyer, IBM 10:30-10:45 Break 10:45-11:15 DryadLINQ - Michael Isard, Microsoft 11:15-11:45 Monitoring Hadoop using X-Trace - Andy Konwinski and Matei Zaharia, UC Berkeley 11:45-12:15 Zookeeper - Ben Reed, Yahoo! 12:15-1:15 Lunch 1:15-1:45 Hbase - Michael Stack, Powerset 1:45-2:15 Hbase App - Bryan Duxbury, Rapleaf 2:15-2:45 Hive - Joydeep Sen Sarma, Facebook 2:45-3:00 Break 3:00-3:20 Building Ground Models of Southern California - Steve Schossler, David O'Hallaron, Intel / CMU 3:20-3:40 Online search for engineering design content - Mike Haley, Autodesk 3:40-4:00 Yahoo - Webmap - Arnab Bhattacharjee, Yahoo! 4:00-4:30 Natural language Processing - Jimmy Lin, U of Maryland / Christophe Bisciglia, Google 4:30-4:45 Break 4:45-5:30 Panel on future directions 5:30-7:00 Happy hour Look forward to seeing you there! Ajay -Original Message- From: Bradford Stephens [mailto:[EMAIL PROTECTED] Sent: Wednesday, February 20, 2008 9:17 AM To: core-user@hadoop.apache.org Subject: Re: Hadoop summit / workshop at Yahoo! Hrm yes, I'd like to make a visit as well :) On Feb 20, 2008 8:05 AM, C G <[EMAIL PROTECTED]> wrote: Hey All: Is this going forward? I'd like to make plans to attend and the sooner I can get plane tickets the happier the bean counters will be :-). Thx, C G Ajay Anand wrote: Yahoo plans to host a summit / workshop on Apache Hadoop at our Sunnyvale campus on March 25th. Given the interest we are seeing from developers in a broad range of organizations, this seems like a good time to get together and brief each other on the progress that is being made. We would like to cover topics in the areas of extensions being developed for Hadoop, innovative applications being built and deployed on Hadoop, and future extensions to the platform. Some of the speakers who have already committed to present are from organizations such as IBM, Intel, Carnegie Mellon University, UC Berkeley, Facebook and Yahoo!, and we are actively recruiting other leaders in the space. If you have an innovative application you would like to talk about, please let us know. Although there are limitations on the amount of time we have, we would love to hear from you. You can contact me at [EMAIL PROTECTED] Thanks and looking forward to hearing about your cool apps, Ajay -- View this message in context: http://www.nabble.com/Hadoop-summit---workshop-at-Yahoo%21-tp14889262p15 393386.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com. - Be a better friend, newshound, and know-it-all with Yahoo! Mobile. Try it now. ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
broadcasting: pig user meeting, Friday, February 8, 2008
Hi there, sorry for cross posting. If everything works out we will video broadcast the event here: http://ustream.tv/channel/apache-pig-user-meeting But no guarantee - sorry. Also we try to setup a telefon voice call in number - please write me a private email if you are interested and I will send out a number. See you tomorrow. Stefan On Feb 6, 2008, at 3:54 PM, Andrzej Bialecki wrote: Otis Gospodnetic wrote: Sorry about the word-wrapping (original email) - Yahoo Mail problem :( Is anyone going to be capturing the Piglet meeting on video for the those of us living in other corners of the planet? Please do! It's too far from Poland to just casually drop by .. ;) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: pig user meeting, Friday, February 8, 2008
Hi Otis, can you suggest a technology how we could do that? Skype? Ichat? Something that is free? I'm happy setup a video conf, however there are no big presentations planed. I was thinking I can give a overview how we use pig for our current project just to reflect our use cases. But beside that I guess it is just pizza and beer. Cheers, Stefan On Feb 6, 2008, at 11:40 AM, Otis Gospodnetic wrote: Sorry about the word-wrapping (original email) - Yahoo Mail problem :( Is anyone going to be capturing the Piglet meeting on video for the those of us living in other corners of the planet? Thank you, Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Stefan Groschupf <[EMAIL PROTECTED]> To: core-user@hadoop.apache.org Sent: Thursday, January 31, 2008 7:09:53 PM Subject: pig user meeting, Friday, February 8, 2008 Hi there, a couple of people plan to meet and talk about apache pig next Friday in the Mountain View area. (Event location is not yet sure). If you are interested please RSVP asap, so we can plan what kind of location size we looking for. http://upcoming.yahoo.com/event/420958/ Cheers, Stefan ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
pig user meeting, Friday, February 8, 2008
Hi there, a couple of people plan to meet and talk about apache pig next Friday in the Mountain View area. (Event location is not yet sure). If you are interested please RSVP asap, so we can plan what kind of location size we looking for. http://upcoming.yahoo.com/event/420958/ Cheers, Stefan ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com
Re: Reduce hangs 2
Hi, not sure if this is the same source of problem, but I also run in problems with a hanging reduce. It is reproducible for me, though I did not find the source of the problem yet. I run a series of jobs and my last job, the last reduce task hangs for about 15 to 20 minutes doing nothing, but than resumes. I running hadoop 15.1 Below the log entries during the hang. So I think it is not the copy problem mentioned before. I also checked our dfs is healthy. 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Need 2 map output(s) 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Got 2 known map output location(s); scheduling... 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Scheduled 2 of 2 known outputs (0 slow hosts and 0 dup hosts) 2008-01-22 21:22:09,327 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com. 2008-01-22 21:22:09,328 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com. 2008-01-22 21:22:11,243 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 done copying task_200801221313_0003_m_40_0 output from hadoop1.dev.company.com. 2008-01-22 21:22:11,610 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 done copying task_200801221313_0003_m_35_0 output from hadoop5.dev.company.com. 2008-01-22 21:22:11,611 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Copying of all map outputs complete. Initiating the last merge on the remaining files in ramfs:// mapoutput169937755 2008-01-22 21:22:11,635 INFO org.apache.hadoop.mapred.ReduceTask: task_200801221313_0003_r_46_1 Merge of the 1 files in InMemoryFileSystem complete. Local file is /home/hadoop/data/hadoop- hadoop/mapred/local/task_200801221313_0003_r_46_1/map_34.out Any ideas? Thanks! Stefan
setting # of maps for a job
Hi, I have trouble setting the number of maps for a job with version 15.1. As far I understand I can configure the number of maps that a job will do in an hadoop-site.xml on the box where I submit the job (that is not the jobtracker box). However my configuration is always ignored. Also changing the value in the hadoop-site on the jobtracker box and restarting the nodes does not help. Also I do not set the number via API. Any ideas where I might oversee something? Thanks for any hints, Stefan ~~~ 101tec Inc. Menlo Park, California, USA http://www.101tec.com