Re: NLineInputFormat and very high number of maptasks
Saptarshi Guha wrote: Sorry, i see - every line is now a maptask - one split,one task.(in this case N=1 line per split) Is that correct? Saptarshi You are right. NLineInputFormat splits N lines of input as one split and each split is given to a map task. By default, N is 1. N can configured through the property "mapred.line.input.format.linespermap". On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote: Hello, When I use NLIneInputFormat, when I output: System.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks")); Where are you printing this statement? Looks like the JobConf, that you are looking at, is not set with the correct value of number of map tasks yet. I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi --Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha If the church put in half the time on covetousness that it does on lust, this would be a better world. -- Garrison Keillor, "Lake Wobegon Days" -Amareshwari
RE: streaming split sizes
Well, database is specifically designed to fit into memory and if it is not it will slow things down hundreds of time. One simple hack I came to is to replace map tasks by /bin/cat and then run 150 reducers that will have database constantly in memory. Parallelism is also not a problems, since we're running very small (15 nodes, 120 cores) specifically built for the task. --- Dmitry Pushkarev +1-650-644-8988 -Original Message- From: Delip Rao [mailto:delip...@gmail.com] Sent: Tuesday, January 20, 2009 6:19 PM To: core-user@hadoop.apache.org Subject: Re: streaming split sizes Hi Dmitry, Not a direct answer to your question but I think the right approach would be to not load your database into memory during config() but instead lookup the database from map() via Hbase or something similar. That way you don't have to worry about the split sizes. In fact using fewer splits would limit the parallelism you can achieve, given that your maps are so fast. - delip On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev wrote: > Hi. > > > > I'm running streaming on relatively big (2Tb) dataset, which is being split > by hadoop in 64mb pieces. One of the problems I have with that is my map > tasks take very long time to initialize (they need to load 3GB database into > RAM) and they are finishing these 64mb in 10 seconds. > > > > So I'm wondering if there is any way to make hadoop give larger datasets to > map jobs? (trivial way, of course would be to split dataset to N files and > make it feed one file at a time, but is there any standard solution for > that?) > > > > Thanks. > > > > --- > > Dmitry Pushkarev > > +1-650-644-8988 > > > >
Re: streaming split sizes
Hi Dmitry, Not a direct answer to your question but I think the right approach would be to not load your database into memory during config() but instead lookup the database from map() via Hbase or something similar. That way you don't have to worry about the split sizes. In fact using fewer splits would limit the parallelism you can achieve, given that your maps are so fast. - delip On Tue, Jan 20, 2009 at 8:25 PM, Dmitry Pushkarev wrote: > Hi. > > > > I'm running streaming on relatively big (2Tb) dataset, which is being split > by hadoop in 64mb pieces. One of the problems I have with that is my map > tasks take very long time to initialize (they need to load 3GB database into > RAM) and they are finishing these 64mb in 10 seconds. > > > > So I'm wondering if there is any way to make hadoop give larger datasets to > map jobs? (trivial way, of course would be to split dataset to N files and > make it feed one file at a time, but is there any standard solution for > that?) > > > > Thanks. > > > > --- > > Dmitry Pushkarev > > +1-650-644-8988 > > > >
streaming split sizes
Hi. I'm running streaming on relatively big (2Tb) dataset, which is being split by hadoop in 64mb pieces. One of the problems I have with that is my map tasks take very long time to initialize (they need to load 3GB database into RAM) and they are finishing these 64mb in 10 seconds. So I'm wondering if there is any way to make hadoop give larger datasets to map jobs? (trivial way, of course would be to split dataset to N files and make it feed one file at a time, but is there any standard solution for that?) Thanks. --- Dmitry Pushkarev +1-650-644-8988
Null Pointer with Pattern file
Hi, I was trying to run Hadoop wordcount version 2 example under Cygwin. I tried without pattern.txt file -- It works fine. I tried with pattern.txt file to skip some patterns, I get NULL POINTER exception as follows:: 09/01/20 12:56:16 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 09/01/20 12:56:17 WARN mapred.JobClient: No job jar file set. User classes may not be found. See JobConf(Class) or JobConf#setJar(String). 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process : 4 09/01/20 12:56:17 INFO mapred.JobClient: Running job: job_local_0001 09/01/20 12:56:17 INFO mapred.FileInputFormat: Total input paths to process : 4 09/01/20 12:56:17 INFO mapred.MapTask: numReduceTasks: 1 09/01/20 12:56:17 INFO mapred.MapTask: io.sort.mb = 100 09/01/20 12:56:17 INFO mapred.MapTask: data buffer = 79691776/99614720 09/01/20 12:56:17 INFO mapred.MapTask: record buffer = 262144/327680 09/01/20 12:56:17 WARN mapred.LocalJobRunner: job_local_0001 java.lang.NullPointerException at org.myorg.WordCount$Map.configure(WordCount.java:39) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34) at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138) java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217) at org.myorg.WordCount.run(WordCount.java:114) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.myorg.WordCount.main(WordCount.java:119) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:165) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) Please tell me what should I do. Thanks, shyam.s.sar...@gmail.com
Come join me on Open Source University Meetup
Open Source University Meetup: Hi all, Please Join Sun Microsystems Open Source University Meetup Its place to share your thoughts, express your feelings , create your blogs and start the discussions on any open source technology Thanks and Regards Vinayak Katkar Sun Campus Ambassador College Of Engineeering,Pune Click the link below to Join: http://osum.sun.com/?xgi=7ogoTKe If your email program doesn't recognize the web address above as an active link, please copy and paste it into your web browser Members already on Open Source University Meetup Siddharth, sonali v, sameer, Pratik, PRADEEP CHAUDHARI About Open Source University Meetup 42674 members 6382 photos 298 videos 2924 discussions 474 events 705 blog posts To control which emails you receive on the corner, or to opt-out, go to: http://osum.sun.com/?xgo=u6u2TApEp3b0uQnDjbhXR25Ah/pbviI-SdX3bY7WTIcscIpZQGX5toNE5tM/2xu6
Class Not found error
Hi, I am following instructions for example wordcount version 2 execution on Hadoop installed under Cygwin. The quick start example worked fine. But word count version 2 is giving following error: java.lang.ClassNotFoundException: org.myorg.WordCount at java.net.URLClassLoader$1.run(URLClassLoader.java:200) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:188) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at java.lang.ClassLoader.loadClass(ClassLoader.java:252) at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.util.RunJar.main(RunJar.java:158) at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68) I checked all files many times in HDFS. They are all in there. Please help.. shyam_sar...@yahoo.com
Re: NLineInputFormat and very high number of maptasks
Sorry, i see - every line is now a maptask - one split,one task.(in this case N=1 line per split) Is that correct? Saptarshi On Jan 20, 2009, at 11:39 AM, Saptarshi Guha wrote: Hello, When I use NLIneInputFormat, when I output: System .out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks")); I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com Saptarshi Guha | saptarshi.g...@gmail.com | http://www.stat.purdue.edu/~sguha If the church put in half the time on covetousness that it does on lust, this would be a better world. -- Garrison Keillor, "Lake Wobegon Days"
NLineInputFormat and very high number of maptasks
Hello, When I use NLIneInputFormat, when I output: System.out.println("mapred.map.tasks:"+jobConf.get("mapred.map.tasks")); I see 51, but on the jobtracker site, the number is 18114. Yet with TextInputFormat it shows 51. I'm using Hadoop - 0.19 Any ideas why? Regards Saptarshi -- Saptarshi Guha - saptarshi.g...@gmail.com
Re: Java RMI and Hadoop RecordIO
David Alves wrote: Hi I've been testing some different serialization techniques, to go along with a research project. I know motivation behind hadoop serialization mechanism (e.g. Writable) and the enhancement of this feature through record I/O is not only performance, but also control of the input/output. Still I've been running some simple tests and I've foud that plain RMi beats Hadoop RecordIO almost every time (14-16% faster). In my test I have a simple java class that has 14 int fields and 1 long field and I'm serializing aroung 35000 instances. Am I doing anything wrong? are there ways to improve performance in RecordIO? Have I got the use case wrong? Regards David Alves -. Any speedups are welcome; people are looking at ProtocolBuffers and Thrift - Are you also measuring packet size and deserialization costs? - add a string or two - and references to other instances - then try pushing a few million round the network using the same serialization stream instance I do use RMI a lot at work, once you come up with a plan to deal with its brittleness against change (we keep the code in the cluster up to date, make no guarantees about compatibility across versions), it is easy to use. but it has so many, many problems, and if you hit one, as the code is deep in the JVM, it is very hard to deal with. One example, RMI tries to send a graph over; it likes to make sure it hasn't pushed a copy over earlier. The longer you keep a serialization stream up, the slower it gets. -steve