Re: using StreamInputFormat, StreamXmlRecordReader with your custom Jobs
Uh, do I have to copy the jar file manually into HDFS before I invoke the hadoop jar command starting my own job? Utkarsh Agarwal wrote: I think you can use DistributedCache to specify the location of the jar after you have it in hdfs.. On Wed, Mar 10, 2010 at 6:11 AM, Reik Schatz reik.sch...@bwin.org wrote: Hi, I am playing around with version 0.20.2 of Hadoop. I have written and packaged a Job using a custom Mapper and Reducer. The input format in my Job is set to StreamInputFormat. Also setting property stream.recordreader.class to org.apache.hadoop.streaming.StreamXmlRecordReader. This is how I want to start my job: hadoop jar custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output The problem is that in this case all classes from hadoop-0.20.2-streaming.jar are missing (ClassNotFoundException). I tried using -libjars without luck. hadoop jar -libjars PATH/hadoop-0.20.2-streaming.jar custom-1.0-SNAPSHOT.jar EmailCountingJob /input /output Any chance to use streaming classes with your own Jobs without copying these classes to your projects and packaging them into your own jar? /Reik
Re: Passing whole text file to a single map
I have the same problem, I need to asign a whole file per map but I don't know how to do that. I've tried to create a new WholeFileFormat.class and override the method isSplitable() but it doesn't seems to work.. Have you achieved to do this ? I'm using hadoop 0.20.2 stolikp wrote: I've got some text files in my input directory and I want to pass each single text file (whole file not just a line) to a map (one file per one map). How can I do this ? TextInputFormat splits text into lines and I do not want this to happen. I tried: http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F but it doesn't work for me, compiler doesn't know what NonSplitableTextInputFormat.class is. I'm using hadoop 0.20.1 -- View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27287649p27860526.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Re: Passing whole text file to a single map
Do you guys have a copy of Tom White's Hadoop book available? There is an excellent example of a WholeFileInputFormat which definitly works with Hadoop-0.20.0. What do you mean by 'it doesn't seem to work'? Exceptions, unexpected output, ... ? Regards, Thomas Am 11.03.2010 10:24, schrieb HypOo: I have the same problem, I need to asign a whole file per map but I don't know how to do that. I've tried to create a new WholeFileFormat.class and override the method isSplitable() but it doesn't seems to work.. Have you achieved to do this ? I'm using hadoop 0.20.2 stolikp wrote: I've got some text files in my input directory and I want to pass each single text file (whole file not just a line) to a map (one file per one map). How can I do this ? TextInputFormat splits text into lines and I do not want this to happen. I tried: http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F but it doesn't work for me, compiler doesn't know what NonSplitableTextInputFormat.class is. I'm using hadoop 0.20.1 -- View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27287649p27860526.html Sent from the Hadoop core-user mailing list archive at Nabble.com.
Call for presentations - Berlin Buzzwords - Summer 2010
Call for Presentations Berlin Buzzwords http://buzzwordsberlin.de Berlin Buzzwords 2010 - Search, Store, Scale 7/8 June 2010 This is to announce the Berlin Buzzwords 2010. The first conference on scalable and open search, data processing and data storage in Germany, taking place in Berlin. The event will comprise presentations on scalable data processing. We invite you to submit talks on the topics: Information retrieval / Search - Lucene, Solr, katta or comparable solutions NoSQL - like CouchDB, MongoDB, Jackrabbit, HBase and others Hadoop - Hadoop itself, MapReduce, Cascading or Pig and relatives Closely related topics not explicitly listed above are welcome. We are looking for presentations on the implementation of the systems themselves, real world applications and case studies. Important Dates (all dates in GMT +2): Submission deadline: April 17th 2010, 23:59 Notification of accepted speakers: May 1st, 2010. Publication of final schedule: May 9th, 2010. Conference: June 7/8. 2010. High quality, technical submissions are called for, ranging from principles to practice. We are looking for real world use cases, background on the architecture of specific projects and a deep dive into architectures built on top of e.g. Hadoop clusters. Proposals should be submitted at http://berlinbuzzwords.de/content/cfp no later than April 17th, 2010. Acceptance notifications will be sent out on May 1st. Please include your name, bio and email, the title of the talk, a brief abstract in English language. Please indicate whether you want to give a short (30min) or long (45min) presentation and indicate the level of experience with the topic your audience should have (e.g. whether your talk will be suitable for newbies or is targeted for experienced users.) The presentation format is short: either 30 or 45 minutes including questions. We will be enforcing the schedule rigorously. If you are interested in sponsoring the event (e.g. we would be happy to provide videos after the event, free drinks for attendees as well as an after-show party), please contact us. Follow @hadoopberlin on Twitter for updates. News on the conference will be published on our website at http://berlinbuzzwords.de Program Chairs: Isabel Drost, Jan Lehnardt, and Simon Willnauer. Schedule and further updates on the event will be published on http://berlinbuzzwords.de Please re-distribute this CfP to people who might be interested. Contact us at: newthinking communications GmbH Schönhauser Allee 6/7 10119 Berlin, Germany Andreas Gebhard a...@newthinking.de Isabel Drost i...@newthinking.de +49(0)30-9210 596 signature.asc Description: This is a digitally signed message part.
Re: Call for presentations - Berlin Buzzwords - Summer 2010
On 11.03.2010 Isabel Drost wrote: Call for Presentations Berlin Buzzwords http://buzzwordsberlin.de http://berlinbuzzwords.de of course... Isabel signature.asc Description: This is a digitally signed message part.
How do I upgrade my hadoop cluster using hadoop?
I thought there was a util to do the upgrade for you that you run from one node and it would do a copy to every other node?
TaskTracker: Java heap space error
Dear All, I am running a hadoop job processing data. The output of map is really large, and it spill 15 times. So I was trying to set io.sort.mb = 256 instead of 100. And I leave everything else default. I am using 0.20.2 version. And when I run the job, I got the following errors: 2010-03-11 11:09:37,581 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2010-03-11 11:09:38,073 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1 2010-03-11 11:09:38,086 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 256 2010-03-11 11:09:38,326 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:781) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I can't figure out why, could anyone please give me a hint? Any hlep will be appreciate! Thanks a lot! SIncerely, Boyu
Re: TaskTracker: Java heap space error
Moving to mapreduce-user@, bcc: common-user Have you tried bumping up the heap for the map task? Since you are setting io.sort.mb to 256M, pls set heap-size to 512M at least, if not more. mapred.child.java.opts - -Xmx512M or -Xmx1024m Arun On Mar 11, 2010, at 8:24 AM, Boyu Zhang wrote: Dear All, I am running a hadoop job processing data. The output of map is really large, and it spill 15 times. So I was trying to set io.sort.mb = 256 instead of 100. And I leave everything else default. I am using 0.20.2 version. And when I run the job, I got the following errors: 2010-03-11 11:09:37,581 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2010-03-11 11:09:38,073 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1 2010-03-11 11:09:38,086 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 256 2010-03-11 11:09:38,326 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask $MapOutputBuffer.init(MapTask.java:781) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I can't figure out why, could anyone please give me a hint? Any hlep will be appreciate! Thanks a lot! SIncerely, Boyu
Re: region server appearing twice on HBase Master page
Bringing the discussion in hbase-user That usually happens after a DNS hiccup. There's a fix for that in https://issues.apache.org/jira/browse/HBASE-2174 J-D On Wed, Mar 10, 2010 at 1:41 PM, Ted Yu yuzhih...@gmail.com wrote: I noticed two lines for the same region server on HBase Master page: X.com:60030 1268160765854 requests=0, regions=16, usedHeap=1068, maxHeap=6127 X.com:60030 1268250726442 requests=21, regions=9, usedHeap=1258, maxHeap=6127 I checked there is only one org.apache.hadoop.hbase.regionserver.HRegionServer instance running on that machine. This is from region server log: 2010-03-10 13:25:38,157 ERROR [IPC Server handler 43 on 60020] regionserver.HRegionServer(844): org.apache.hadoop.hbase.NotServingRegionException: ruletable,,1268083966723 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) 2010-03-10 13:25:38,189 ERROR [IPC Server handler 0 on 60020] regionserver.HRegionServer(844): org.apache.hadoop.hbase.NotServingRegionException: ruletable,,1268083966723 at org.apache.hadoop.hbase.regionserver.HRegionServer.getRegion(HRegionServer.java:2307) at org.apache.hadoop.hbase.regionserver.HRegionServer.get(HRegionServer.java:1784) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648) at org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915) If you know how to troubleshoot, please share.
Re: TaskTracker: Java heap space error
Hi Arun, I did what you said, and it seems to work. Thanks a lot! I guess I do not completely understand how the tuning parameters affect each other. Thanks! Boyu On Thu, Mar 11, 2010 at 12:27 PM, Arun C Murthy a...@yahoo-inc.com wrote: Moving to mapreduce-user@, bcc: common-user Have you tried bumping up the heap for the map task? Since you are setting io.sort.mb to 256M, pls set heap-size to 512M at least, if not more. mapred.child.java.opts - -Xmx512M or -Xmx1024m Arun On Mar 11, 2010, at 8:24 AM, Boyu Zhang wrote: Dear All, I am running a hadoop job processing data. The output of map is really large, and it spill 15 times. So I was trying to set io.sort.mb = 256 instead of 100. And I leave everything else default. I am using 0.20.2 version. And when I run the job, I got the following errors: 2010-03-11 11:09:37,581 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2010-03-11 11:09:38,073 INFO org.apache.hadoop.mapred.MapTask: numReduceTasks: 1 2010-03-11 11:09:38,086 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 256 2010-03-11 11:09:38,326 FATAL org.apache.hadoop.mapred.TaskTracker: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:781) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:350) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I can't figure out why, could anyone please give me a hint? Any hlep will be appreciate! Thanks a lot! SIncerely, Boyu