Fw: important
Hello! Important message, visit http://sklepprosport.pl/direction.php?mu4z Mark J. Hoy
Re: Multi-threaded map task
Never mind, depends on plantform, in my case would work fine. Thanks guys! Mark On Mon, Jan 14, 2013 at 12:23 PM, Mark Olimpiati markq2...@gmail.comwrote: Thanks Bertrand, I shall try it and hope to gain some speed. One last question though, do you think the threads used are user-level or kernel-level threads in MultithreadedMapper ? Mark On Mon, Jan 14, 2013 at 12:06 AM, Bertrand Dechoux decho...@gmail.comwrote: Bertrand
Re: Multi-threaded map task
Thanks for the reply Nitin, but I don't see what's the bottleneck of having it distributed with multi-threaded maps ? I see your point in that each map is processing different splits, but my question is if each map task had 2 threads multiplexing or running in parallel if there is enough cores to process the same split, wouldn't that be faster with enough cores? Mark On Sun, Jan 13, 2013 at 10:34 PM, Nitin Pawar nitinpawar...@gmail.comwrote: Thats because its distributed processing framework over network On Jan 14, 2013 11:27 AM, Mark Olimpiati markq2...@gmail.com wrote: Hi, this is a simple question, but why wasn't map or reduce tasks programmed to be multi-threaded ? ie. instead of spawning 6 map tasks for 6 cores, run one map task with 6 parallel threads. In fact I tried this myself, but turns that threading is not helping as it would be in regular java programs for some reason .. any feedback on this topic? Thanks, Mark
Re: Maps split size
Well, when I said I found a solution this link was one of them :). Even though I set : dfs.block.size = mapred.min.split.size = mapred.max.split.size = 14MB the job is still running maps with 64MB ! I don't see what else can I change :( Thanks, Mark On Fri, Oct 26, 2012 at 2:23 PM, Bertrand Dechoux decho...@gmail.comwrote: Hi Mark, I think http://wiki.apache.org/hadoop/HowManyMapsAndReduces might interest you. If you require more information, feel free to ask after reading it. Regards Bertrand On Fri, Oct 26, 2012 at 10:47 PM, Mark Olimpiati markq2...@gmail.com wrote: Hi, I've found that the solution to control the split size per mapper is to modify the following configurations: mapred.min.split.size and mapred.max.split.size, but when I set them both to 14MB with dfs.block.size = 64MB, the splits are still = 64MB. So, is there a relation between them that I should consider? Thank you, Mark -- Bertrand Dechoux
Re: Hadoop and Cuda , JCuda (CPU+GPU architecture)
Oleg, I, on the other hand, have a project that might benefit, but not the implementation as yet. http://frd.org/ is very CPU intensive. So please share your notes. Mark On Mon, Sep 24, 2012 at 10:30 AM, Oleg Ruchovets oruchov...@gmail.comwrote: Hi I am going to process video analytics using hadoop I am very interested about CPU+GPU architercute espessially using CUDA ( http://www.nvidia.com/object/cuda_home_new.html) and JCUDA ( http://jcuda.org/) Does using HADOOP and CPU+GPU architecture bring significant performance improvement and does someone succeeded to implement it in production quality? I didn't fine any projects / examples to use such technology. If someone could give me a link to best practices and example using CUDA/JCUDA + hadoop that would be great. Thanks in advane Oleg.
Re: Metrics ..
Hi David, I enabled the jvm.class of the hadoop-metrics.properties, you're output seems to be from something else (dfs.class or mapred.class) which reports hadoop deamons performace. For example your output shows processName=TaskTracker which I'm not looking for. How can I report jvm statistics for individual jvms (maps/reducers) ?? Thank you, Mark On Wed, Aug 29, 2012 at 1:28 PM, Wong, David (DMITS) dav...@hp.com wrote: Here's a snippet of tasktracker metrics using Metrics2. (I think there were (more) gaps in the pre-metrics2 versions.) Note that you'll need to have hadoop-env.sh and hadoop-metrics2.properties setup on all the nodes you want reports from. 1345570905436 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net, loginSuccess_num_ops=0, loginSuccess_avg_time=0.0, loginFailure_num_ops=0, loginFailure_avg_time=0.0 1345570905436 jvm.metrics: context=jvm, processName=TaskTracker, sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net, memNonHeapUsedM=11.540627, memNonHeapCommittedM=18.25, memHeapUsedM=12.972412, memHeapCommittedM=61.375, gcCount=1, gcTimeMillis=6, threadsNew=0, threadsRunnable=9, threadsBlocked=0, threadsWaiting=9, threadsTimedWaiting=1, threadsTerminated=0, logFatal=0, logError=0, logWarn=0, logInfo=1 1345570905436 mapred.tasktracker: context=mapred, sessionId=, hostName= sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0, mapTaskSlots=2, reduceTaskSlots=2, tasks_completed=0, tasks_failed_timeout=0, tasks_failed_ping=0 1345570905436 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997, hostName=sqws31.caclab.cac.cpqcorp.net 1345570905436 rpc.rpc: context=rpc, port=33997, hostName= sqws31.caclab.cac.cpqcorp.net, rpcAuthenticationSuccesses=0, rpcAuthenticationFailures=0, rpcAuthorizationSuccesses=0, rpcAuthorizationFailures=0, ReceivedBytes=0, SentBytes=0, RpcQueueTime_num_ops=0, RpcQueueTime_avg_time=0.0, RpcProcessingTime_num_ops=0, RpcProcessingTime_avg_time=0.0, NumOpenConnections=0, callQueueLen=0 1345570905436 metricssystem.MetricsSystem: context=metricssystem, hostName= sqws31.caclab.cac.cpqcorp.net, num_sources=5, num_sinks=1, sink.file.latency_num_ops=0, sink.file.latency_avg_time=0.0, sink.file.dropped=0, sink.file.qsize=0, snapshot_num_ops=5, snapshot_avg_time=0.2, snapshot_stdev_time=0.447213595499958, snapshot_imin_time=0.0, snapshot_imax_time=1.0, snapshot_min_time=0.0, snapshot_max_time=1.0, publish_num_ops=0, publish_avg_time=0.0, publish_stdev_time=0.0, publish_imin_time=3.4028234663852886E38, publish_imax_time=1.401298464324817E-45, publish_min_time=3.4028234663852886E38, publish_max_time=1.401298464324817E-45, dropped_pub_all=0 1345570915435 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net 1345570915435 jvm.metrics: context=jvm, processName=TaskTracker, sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net, memNonHeapUsedM=11.549316, memNonHeapCommittedM=18.25, memHeapUsedM=13.136337, memHeapCommittedM=61.375, gcCount=1, gcTimeMillis=6, threadsNew=0, threadsRunnable=9, threadsBlocked=0, threadsWaiting=9, threadsTimedWaiting=1, threadsTerminated=0, logFatal=0, logError=0, logWarn=0, logInfo=1 1345570915435 mapred.tasktracker: context=mapred, sessionId=, hostName= sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0, mapTaskSlots=2, reduceTaskSlots=2 1345570915435 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997, hostName=sqws31.caclab.cac.cpqcorp.net 1345570915435 rpc.rpc: context=rpc, port=33997, hostName= sqws31.caclab.cac.cpqcorp.net 1345570915435 metricssystem.MetricsSystem: context=metricssystem, hostName= sqws31.caclab.cac.cpqcorp.net, num_sources=5, num_sinks=1, sink.file.latency_num_ops=1, sink.file.latency_avg_time=4.0, snapshot_num_ops=11, snapshot_avg_time=0.16669, snapshot_stdev_time=0.408248290463863, snapshot_imin_time=0.0, snapshot_imax_time=1.0, snapshot_min_time=0.0, snapshot_max_time=1.0, publish_num_ops=1, publish_avg_time=0.0, publish_stdev_time=0.0, publish_imin_time=0.0, publish_imax_time=1.401298464324817E-45, publish_min_time=0.0, publish_max_time=1.401298464324817E-45, dropped_pub_all=0 1345570925435 ugi.ugi: context=ugi, hostName=sqws31.caclab.cac.cpqcorp.net 1345570925435 jvm.metrics: context=jvm, processName=TaskTracker, sessionId=, hostName=sqws31.caclab.cac.cpqcorp.net, memNonHeapUsedM=13.002403, memNonHeapCommittedM=18.25, memHeapUsedM=11.503555, memHeapCommittedM=61.375, gcCount=2, gcTimeMillis=12, threadsNew=0, threadsRunnable=9, threadsBlocked=0, threadsWaiting=13, threadsTimedWaiting=7, threadsTerminated=0, logFatal=0, logError=0, logWarn=0, logInfo=3 1345570925435 mapred.tasktracker: context=mapred, sessionId=, hostName= sqws31.caclab.cac.cpqcorp.net, maps_running=0, reduces_running=0, mapTaskSlots=2, reduceTaskSlots=2 1345570925435 rpcdetailed.rpcdetailed: context=rpcdetailed, port=33997, hostName=sqws31.caclab.cac.cpqcorp.net 1345570925435 rpc.rpc: context=rpc, port=33997
Past meeting: July Houston Hadoop Meetup - Genomic data analysis with hadoop
Hi, all, that's what it was about July Houston Hadoop Meetup - Genomic data analysis with hadoophttp://shmsoft.blogspot.com/2012/07/july-houston-hadoop-meetup-genomic-data.html http://2.bp.blogspot.com/-LQOZ0kppE7Y/UATvSSC-CyI/KT0/3cVl_S83Tkg/s1600/Genome.pngDianhui (Dennis) Zhu presented Genomic data analysis with hadoop. He talked about using Hadoop framework to do pattern search in genomic sequence datasets. This is based on his three-year project at Baylor, which started using Hadoop a year ago. Dennis is Senior Scientific Programmer at HGSC. Dianhui told us about the following issues 1. Setup a Hadoop test cluster with 4 nodes. 2. Code walk through and unit testing with Mokito and MRUnit 3. Live demo: running our Hadoop application on the 4-node cluster. The interesting technical problem that Dennis showed was to break sequence into chunks, before it gets to the Mapper - which is usually trivial in the regular applications, but is quite hard with unlimited unstructured data of the genome. The audience analyzed the actual code, asked many questions, and wanted to compare to the existing open source projects. Indeed, that is an article on the Cloudera blog, http://www.cloudera.com/blog/2009/10/analyzing-human-genomes-with-hadoop/, and it refers to the Crossbow open source project, http://bowtie-bio.sourceforge.net/crossbow/index.shtml. It will interested to see how that compares to Dennis's work.
Do I have to sort?
Hi, it may be a stupid question, but in my application I could do without sort by keys. If only reducers could be told to start their work on the first maps that they see, my processing would begin to show results much earlier, before all the mappers are done. Now, eventually, all mappers will have to finish, so I am not gaining on the total task duration, but only on first results appearing faster. Then, if course, I could obtain some intermediates statistics with counters or with some additional NoSQL database. I am also concerned about millions of maps that my mappers are emitting - is that OK? Am I putting too much of a burden on the shuffle stage? Thank you, Mark
Re: Do I have to sort?
John, that sounds very interesting, and I may implement such a workflow, but can I write back to HDFS in the mapper? In the reducer it is a standard context.write(), but it is a different context. Thank you, Mark On Mon, Jun 18, 2012 at 9:24 AM, John Armstrong j...@ccri.com wrote: On 06/18/2012 10:19 AM, Mark Kerzner wrote: If only reducers could be told to start their work on the first maps that they see, my processing would begin to show results much earlier, before all the mappers are done. The sort/shuffle phase isn't just about ordering the keys, it's about collecting all the results of the map phase that share a key together for the reducers to work on. If your reducer can operate on mapper outputs independently of each other, then it sounds like it's really another mapper and should be either factored into the mapper or rewritten as a mapper on its own and both mappers thrown into the ChainMapper (if you're using the older API).
Re: Do I have to sort?
Thank you for the great instructions! Mark On Mon, Jun 18, 2012 at 9:53 AM, John Armstrong j...@ccri.com wrote: On 06/18/2012 10:40 AM, Mark Kerzner wrote: that sounds very interesting, and I may implement such a workflow, but can I write back to HDFS in the mapper? In the reducer it is a standard context.write(), but it is a different context. Both Mapper.Context and Reducer.Context descend from TaskInputOutputContext, which is where the write() method is defined, so they're both outputting their data in the same way. If you don't have a Reducer -- only Mappers and fully parallel data processing -- then when you configure your job you set the number of reducers to zero. Then the mapper context knows that mapper output is the last step, so it uses the specified OutputFormat to write out the data, just like your reducer context currently does with reducer output.
Re: only ouput values, no keys, no reduce
You can use Hadoop NullWritable http://hadoop.apache.org/common/docs/r1.0.1/api/org/apache/hadoop/io/NullWritable.Comparator.html Mark On Mon, Jun 11, 2012 at 8:10 AM, huanchen.zhang huanchen.zh...@ipinyou.comwrote: hi, I am developing a map reduce program which has no reduce. And I just want the maps to output all the values which meet some requrements (no keys output). what should I do in this case? I tried context.write(Text, Text), but it outputs both keys and values. Thank you ! Best, Huanchen 2012-06-11 huanchen.zhang
Re: different input/output formats
Thanks for the reply but I already tried this option, and is the error: java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:60) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Mark On Tue, May 29, 2012 at 1:05 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark public void map(LongWritable offset, Text val,OutputCollector FloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(*1*), val); *//chanage 1 to 1.0f then it will work.* } let me know the status after the change On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.com wrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Re: different input/output formats
Hi Samir, can you email me your main class.. or if you can check mine, it is as follows: public class SortByNorm1 extends Configured implements Tool { @Override public int run(String[] args) throws Exception { if (args.length != 2) { System.err.printf(Usage:bin/hadoop jar norm1.jar inputDir outputDir\n); ToolRunner.printGenericCommandUsage(System.err); return -1; } JobConf conf = new JobConf(new Configuration(),SortByNorm1.class); conf.setJobName(SortDocByNorm1); conf.setMapperClass(Norm1Mapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setReducerClass(Norm1Reducer.class); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); return 0; } public static void main(String[] args) throws Exception { int exitCode = ToolRunner.run(new SortByNorm1(), args); System.exit(exitCode); } On Tue, May 29, 2012 at 1:55 PM, samir das mohapatra samir.help...@gmail.com wrote: Hi Mark See the out put for that same Application . I am not getting any error. On Wed, May 30, 2012 at 1:27 AM, Mark question markq2...@gmail.comwrote: Hi guys, this is a very simple program, trying to use TextInputFormat and SequenceFileoutputFormat. Should be easy but I get the same error. Here is my configurations: conf.setMapperClass(myMapper.class); conf.setMapOutputKeyClass(FloatWritable.class); conf.setMapOutputValueClass(Text.class); conf.setNumReduceTasks(0); conf.setOutputKeyClass(FloatWritable.class); conf.setOutputValueClass(Text.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(SequenceFileOutputFormat.class); TextInputFormat.addInputPath(conf, new Path(args[0])); SequenceFileOutputFormat.setOutputPath(conf, new Path(args[1])); myMapper class is: public class myMapper extends MapReduceBase implements MapperLongWritable,Text,FloatWritable,Text { public void map(LongWritable offset, Text val,OutputCollectorFloatWritable,Text output, Reporter reporter) throws IOException { output.collect(new FloatWritable(1), val); } } But I get the following error: 12/05/29 12:54:31 INFO mapreduce.Job: Task Id : attempt_201205260045_0032_m_00_0, Status : FAILED java.io.IOException: wrong key class: org.apache.hadoop.io.LongWritable is not class org.apache.hadoop.io.FloatWritable at org.apache.hadoop.io.SequenceFile$Writer.append(SequenceFile.java:998) at org.apache.hadoop.mapred.SequenceFileOutputFormat$1.write(SequenceFileOutputFormat.java:75) at org.apache.hadoop.mapred.MapTask$DirectMapOutputCollector.collect(MapTask.java:705) at org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:508) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:59) at filter.stat.cosine.preprocess.SortByNorm1$Norm1Mapper.map(SortByNorm1.java:1) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:397) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:330) at org.apache.hadoop.mapred.Child$4.run(Child.java:217) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.Use Where is the writing of LongWritable coming from ?? Thank you, Mark
Memory exception in the mapper
Hi, all, I got the exception below in the mapper. I already have my global Hadoop heap at 5 GB, but is there a specific other setting? Or maybe I should troubleshoot for memory? But the same application works in the IDE. Thank you! Mark *stderr logs* Exception in thread Thread for syncLogs java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(BufferedOutputStream.java:76) at java.io.BufferedOutputStream.init(BufferedOutputStream.java:59) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365) at org.apache.hadoop.mapred.Child$3.run(Child.java:157) Exception in thread communication thread java.lang.OutOfMemoryError: Java heap space Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread communication thread
Re: Memory exception in the mapper
Joey, my errors closely resembles this onehttp://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201006.mbox/%3caanlktikr3df4ce-tgiphv9_-evfoed_5-t684nf4y...@mail.gmail.com%3Ein the archives. I can now be much more specific with the errors message, and it is quoted below. I tried -Xmx3096. But I got the same error. Thank you, Mark syslog logs 2012-05-23 20:04:52,349 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-05-23 20:04:52,519 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2012-05-23 20:04:52,695 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2012-05-23 20:04:52,699 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@d56b37 2012-05-23 20:04:52,813 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2012-05-23 20:04:53,010 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded 2012-05-23 20:12:29,120 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 79542629; bufvoid = 99614720 2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 228; length = 327680 2012-05-23 20:12:31,248 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: bufstart = 79542629; bufend = 53863940; bufvoid = 99614720 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: kvstart = 228; kvend = 431; length = 327680 2012-05-23 20:13:03,294 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2012-05-23 20:13:48,121 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: bufstart = 53863940; bufend = 31696780; bufvoid = 99614720 2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: kvstart = 431; kvend = 861; length = 327680 2012-05-23 20:13:49,818 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: bufstart = 31696780; bufend = 10267329; bufvoid = 99614720 2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: kvstart = 861; kvend = 1462; length = 327680 2012-05-23 20:15:27,068 INFO org.apache.hadoop.mapred.MapTask: Finished spill 3 2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: bufstart = 10267329; bufend = 85241086; bufvoid = 99614720 2012-05-23 20:15:53,519 INFO org.apache.hadoop.mapred.MapTask: kvstart = 1462; kvend = 1642; length = 327680 2012-05-23 20:15:54,760 INFO org.apache.hadoop.mapred.MapTask: Finished spill 4 2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: bufstart = 85241086; bufend = 51305930; bufvoid = 99614720 2012-05-23 20:16:26,284 INFO org.apache.hadoop.mapred.MapTask: kvstart = 1642; kvend = 1946; length = 327680 2012-05-23 20:16:27,566 INFO org.apache.hadoop.mapred.MapTask: Finished spill 5 2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: bufstart = 51305930; bufend = 31353466; bufvoid = 99614720 2012-05-23 20:16:57,046 INFO org.apache.hadoop.mapred.MapTask: kvstart = 1946; kvend = 2263; length = 327680 2012-05-23 20:16:58,076 INFO org.apache.hadoop.mapred.MapTask: Finished spill 6 2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: bufstart = 31353466; bufend = 10945750; bufvoid = 99614720 2012-05-23 20:17:52,820 INFO org.apache.hadoop.mapred.MapTask: kvstart = 2263; kvend = 2755; length = 327680 2012-05-23 20:17:53,939 INFO org.apache.hadoop.mapred.MapTask: Finished spill 7 2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: bufstart = 10945750; bufend = 81838103; bufvoid = 99614720 2012-05-23 20:18:19,528 INFO org.apache.hadoop.mapred.MapTask: kvstart = 2755; kvend = 2967; length = 327680 2012-05-23 20:18:21,145 INFO org.apache.hadoop.mapred.MapTask: Finished spill 8 2012-05-23
Re: Memory exception in the mapper
Arun, I am running the latest CDH3, which I re-installed yesterday, so I believe it is Hadoop 0.21. I have about 6000 maps emitted, and 16 spills, and then I see Mapper cleanup() being called, after which I get this error 2012-05-23 20:22:58,108 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) Thank you, Mark On Wed, May 23, 2012 at 9:29 PM, Arun C Murthy a...@hortonworks.com wrote: What version of hadoop are you running? On May 23, 2012, at 12:16 PM, Mark Kerzner wrote: Hi, all, I got the exception below in the mapper. I already have my global Hadoop heap at 5 GB, but is there a specific other setting? Or maybe I should troubleshoot for memory? But the same application works in the IDE. Thank you! Mark *stderr logs* Exception in thread Thread for syncLogs java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(BufferedOutputStream.java:76) at java.io.BufferedOutputStream.init(BufferedOutputStream.java:59) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365) at org.apache.hadoop.mapred.Child$3.run(Child.java:157) Exception in thread communication thread java.lang.OutOfMemoryError: Java heap space Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread communication thread -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Memory exception in the mapper
Arun, Actually CDH3 is Hadoop 0.20, but with .21 backported, so I am using 0.21 API whenever I can. Mark On Wed, May 23, 2012 at 9:40 PM, Mark Kerzner mark.kerz...@shmsoft.comwrote: Arun, I am running the latest CDH3, which I re-installed yesterday, so I believe it is Hadoop 0.21. I have about 6000 maps emitted, and 16 spills, and then I see Mapper cleanup() being called, after which I get this error 2012-05-23 20:22:58,108 FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.IFile$Reader.readNextBlock(IFile.java:355) Thank you, Mark On Wed, May 23, 2012 at 9:29 PM, Arun C Murthy a...@hortonworks.comwrote: What version of hadoop are you running? On May 23, 2012, at 12:16 PM, Mark Kerzner wrote: Hi, all, I got the exception below in the mapper. I already have my global Hadoop heap at 5 GB, but is there a specific other setting? Or maybe I should troubleshoot for memory? But the same application works in the IDE. Thank you! Mark *stderr logs* Exception in thread Thread for syncLogs java.lang.OutOfMemoryError: Java heap space at java.io.BufferedOutputStream.init(BufferedOutputStream.java:76) at java.io.BufferedOutputStream.init(BufferedOutputStream.java:59) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365) at org.apache.hadoop.mapred.Child$3.run(Child.java:157) Exception in thread communication thread java.lang.OutOfMemoryError: Java heap space Exception: java.lang.OutOfMemoryError thrown from the UncaughtExceptionHandler in thread communication thread -- Arun C. Murthy Hortonworks Inc. http://hortonworks.com/
Re: Memory exception in the mapper
Thanks, Joey, we are in beta, and I kinda need these for debugging. But as soon as we go to production, your word is well taken. (I hope we will replace the current primitive logging with good one (log4j is I think preferred with Hadoop), and then we can change the log level. Mark On Wed, May 23, 2012 at 10:39 PM, Joey Krabacher jkrabac...@gmail.comwrote: No problem, glad I could help. In our test environment I have lots of output and logging turned on, but as soon as it is on production all output and logging is reduced to the bare minimum. Basically, in production we only log caught exceptions. I would take it out unless you absolutely need it. IMHO. If your jobs are not mission critical and do not need to run as smooth as possible then it's not as important to remove those. /* Joey */ On Wed, May 23, 2012 at 10:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Joey, that did the trick! Actually, I am writing to the log with System.out.println() statements, and I write about 12,000 lines, would that be a problem? I don't really need this output, so if you think it's inadvisable, I will remove that. Also, I hope that if I have not 6,000 maps but 12,000 or even 30,000, it will still work. Well, I will see pretty soon, I guess, with more data. Again, thank you. Sincerely, Mark On Wed, May 23, 2012 at 9:43 PM, Joey Krabacher jkrabac...@gmail.com wrote: Mark, Have you tried tweaking the mapred.child.java.opts property in your mapred-site.xml? property namemapred.child.java.opts/name value-Xmx2048m/value /property This might help. It looks like the fatal error came right after the log truncater fired off. Are you outputting anything to the logs manually, or have you looked at the user logs to see if there is anything taking up lots of room? / * Joey */ On Wed, May 23, 2012 at 9:35 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Joey, my errors closely resembles this one http://mail-archives.apache.org/mod_mbox/hadoop-mapreduce-user/201006.mbox/%3caanlktikr3df4ce-tgiphv9_-evfoed_5-t684nf4y...@mail.gmail.com%3E in the archives. I can now be much more specific with the errors message, and it is quoted below. I tried -Xmx3096. But I got the same error. Thank you, Mark syslog logs 2012-05-23 20:04:52,349 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-05-23 20:04:52,519 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM Metrics with processName=MAP, sessionId= 2012-05-23 20:04:52,695 INFO org.apache.hadoop.util.ProcessTree: setsid exited with exit code 0 2012-05-23 20:04:52,699 INFO org.apache.hadoop.mapred.Task: Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@d56b37 2012-05-23 20:04:52,813 INFO org.apache.hadoop.mapred.MapTask: io.sort.mb = 100 2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: data buffer = 79691776/99614720 2012-05-23 20:04:52,998 INFO org.apache.hadoop.mapred.MapTask: record buffer = 262144/327680 2012-05-23 20:04:53,010 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy: Snappy native library not loaded 2012-05-23 20:12:29,120 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: bufstart = 0; bufend = 79542629; bufvoid = 99614720 2012-05-23 20:12:29,134 INFO org.apache.hadoop.mapred.MapTask: kvstart = 0; kvend = 228; length = 327680 2012-05-23 20:12:31,248 INFO org.apache.hadoop.mapred.MapTask: Finished spill 0 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: bufstart = 79542629; bufend = 53863940; bufvoid = 99614720 2012-05-23 20:13:01,862 INFO org.apache.hadoop.mapred.MapTask: kvstart = 228; kvend = 431; length = 327680 2012-05-23 20:13:03,294 INFO org.apache.hadoop.mapred.MapTask: Finished spill 1 2012-05-23 20:13:48,121 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: bufstart = 53863940; bufend = 31696780; bufvoid = 99614720 2012-05-23 20:13:48,122 INFO org.apache.hadoop.mapred.MapTask: kvstart = 431; kvend = 861; length = 327680 2012-05-23 20:13:49,818 INFO org.apache.hadoop.mapred.MapTask: Finished spill 2 2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: Spilling map output: buffer full= true 2012-05-23 20:15:25,618 INFO org.apache.hadoop.mapred.MapTask: bufstart = 31696780
Where does Hadoop store its maps?
Hi, I am using a Hadoop cluster of my own construction on EC2, and I am running out of hard drive space with maps. If I knew which directories are used by Hadoop for map spill, I could use the large ephemeral drive on EC2 machines for that. Otherwise, I would have to keep increasing my available hard drive on root, and that's not very smart. Thank you. The error I get is below. Sincerely, Mark org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279) at org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107) at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55) at org.frd.main.Map.map(Map.java:70) at org.frd.main.Map.map(Map.java:24) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(User java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279) at org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107) at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55) at org.frd.main.Map.map(Map.java:70) at org.frd.main.Map.map(Map.java:24) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(User org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: EEXIST: File exists at org.apache.hadoop.io.nativeio.NativeIO.open(Native Method) at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:172) ... 7 more
Re: Where does Hadoop store its maps?
Thank you, Harsh and Madhu, that is exactly what I was looking for. Mark On Tue, May 22, 2012 at 8:36 AM, madhu phatak phatak@gmail.com wrote: Hi, Set mapred.local.dir in mapred-site.xml to point a directory on /mnt so that it will not use ec2 instance EBS. On Tue, May 22, 2012 at 6:58 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I am using a Hadoop cluster of my own construction on EC2, and I am running out of hard drive space with maps. If I knew which directories are used by Hadoop for map spill, I could use the large ephemeral drive on EC2 machines for that. Otherwise, I would have to keep increasing my available hard drive on root, and that's not very smart. Thank you. The error I get is below. Sincerely, Mark org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:376) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:146) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:127) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:69) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1495) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1180) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.close(MapTask.java:582) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:649) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279) at org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107) at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55) at org.frd.main.Map.map(Map.java:70) at org.frd.main.Map.map(Map.java:24) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(User java.io.IOException: Spill failed at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:886) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:574) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.frd.main.ZipFileProcessor.emitAsMap(ZipFileProcessor.java:279) at org.frd.main.ZipFileProcessor.processWithTrueZip(ZipFileProcessor.java:107) at org.frd.main.ZipFileProcessor.process(ZipFileProcessor.java:55) at org.frd.main.Map.map(Map.java:70) at org.frd.main.Map.map(Map.java:24) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:647) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:323) at org.apache.hadoop.mapred.Child$4.run(Child.java:270) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(User org.apache.hadoop.io.SecureIOUtils$AlreadyExistsException: EEXIST: File exists at org.apache.hadoop.io.SecureIOUtils.createForWrite(SecureIOUtils.java:178) at org.apache.hadoop.mapred.TaskLog.writeToIndexFile(TaskLog.java:292) at org.apache.hadoop.mapred.TaskLog.syncLogs(TaskLog.java:365) at org.apache.hadoop.mapred.Child$4.run(Child.java:272) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1157) at org.apache.hadoop.mapred.Child.main(Child.java:264) Caused by: EEXIST: File exists
Re: How to add debugging to map- red code
I'm interested in this too, but could you tell me where to apply the patch and is the following the right command to write it: https://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patchpatch MAPREDUCE-336_0_20090818.patchhttps://issues.apache.org/jira/secure/attachment/12416955/MAPREDUCE-336_0_20090818.patch Thank you, Mark On Fri, Apr 20, 2012 at 8:28 AM, Harsh J ha...@cloudera.com wrote: Yes this is possible, and there's two ways to do this. 1. Use a distro/release that carries the https://issues.apache.org/jira/browse/MAPREDUCE-336 fix. This will let you avoid work (see 2, which is same as your idea) 2. Configure your implementation's logger object's level in the setup/setConf methods of the task, by looking at some conf prop to decide the level. This will work just as well - and will also avoid changing Hadoop's own Child log levels, unlike the (1) method. On Fri, Apr 20, 2012 at 8:47 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, I m trying to find out best way to add debugging in map- red code. I have System.out.println() statements that I keep on commenting and uncommenting so as not to increase stdout size But problem is anytime I need debug, I Hv to re-compile. If there a way, I can define log levels using log4j in map-red code and define log level as conf option ? Thanks, JJ Sent from my iPhone -- Harsh J
Has anyone installed HCE and built it successfully?
Hey guys, I've been stuck with HCE installation for two days now and can't figure out the problem. Errors I get from running (sh build.sh) is can not execute binary file . I tried setting my JAVA_HOME and ANT_HOME manually and using the script build.sh, no luck. So, please if you've used HCE could you share with me your knowledge. Thank you, Mark
Re: Hadoop streaming or pipes ..
Thanks all, and Charles you guided me to Baidu slides titled: Introduction to *Hadoop C++ Extension*http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5 which is their experience and the sixth-slide shows exactly what I was looking for. It is still hard to manage memory with pipes besides the no performance gains, hence the advancement of HCE. Thanks, Mark On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl charles.ce...@gmail.comwrote: Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. Would be interested to know if the community has any experience with HCE performance. C On Apr 5, 2012, at 3:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Hadoop pipes and streaming ..
Hi guys, Two quick questions: 1. Are there any performance gains from hadoop streaming or pipes ? As far as I read, it is to ease testing using your favorite language. Which I think implies that everything is translated to bytecode eventually and executed.
Hadoop streaming or pipes ..
Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Hadoop streaming or pipes ..
Thanks for the response Robert .. so the overhead will be in read/write and communication. But is the new process spawned a JVM or a regular process? Thanks, Mark On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans ev...@yahoo-inc.com wrote: Both streaming and pipes do very similar things. They will fork/exec a separate process that is running whatever you want it to run. The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back. The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here. Pipes uses a custom protocol with a C++ library to communicate. The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl. I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again. --Bobby Evans On 4/5/12 1:54 PM, Mark question markq2...@gmail.com wrote: Hi guys, quick question: Are there any performance gains from hadoop streaming or pipes over Java? From what I've read, it's only to ease testing by using your favorite language. So I guess it is eventually translated to bytecode then executed. Is that true? Thank you, Mark
Re: Yahoo Hadoop Tutorial with new APIs?
Hi, any interest in joining with this effort of mine? http://hadoopilluminated.com/ - I am also doing only for community benefit. I have more chapters that I am putting out. But, I want to keep the fun, informal style. Thanks, Mark On Wed, Apr 4, 2012 at 4:29 PM, Robert Evans ev...@yahoo-inc.com wrote: I am dropping the cross posts and leaving this on common-user with the others BCCed. Marcos, That is a great idea to be able to update the tutorial, especially if the community is interested in helping to do so. We are looking into the best way to do this. The idea right now is to donate this to the Hadoop project so that the community can keep it up to date, but we need some time to jump through all of the corporate hoops to get this to happen. We have a lot going on right now, so if you don't see any progress on this please feel free to ping me and bug me about it. -- Bobby Evans On 4/4/12 8:15 AM, Jagat Singh jagatsi...@gmail.com wrote: Hello Marcos Yes , Yahoo tutorials are pretty old but still they explain the concepts of Map Reduce , HDFS beautifully. The way in which tutorials have been defined into sub sections , each builing on previous one is awesome. I remember when i started i was digged in there for many days. The tutorials are lagging now from new API point of view. Lets have some documentation session one day , I would love to Volunteer to update those tutorials if people at Yahoo take input from outside world :) Regards, Jagat - Original Message - From: Marcos Ortiz Sent: 04/04/12 08:32 AM To: common-user@hadoop.apache.org, 'hdfs-u...@hadoop.apache.org', mapreduce-u...@hadoop.apache.org Subject: Yahoo Hadoop Tutorial with new APIs? Regards to all the list. There are many people that use the Hadoop Tutorial released by Yahoo at http://developer.yahoo.com/hadoop/tutorial/ http://developer.yahoo.com/hadoop/tutorial/module4.html#chaining The main issue here is that, this tutorial is written with the old APIs? (Hadoop 0.18 I think). Is there a project for update this tutorial to the new APIs? to Hadoop 1.0.2 or YARN (Hadoop 0.23) Best wishes -- Marcos Luis Ortíz Valmaseda (@marcosluis2186) Data Engineer at UCI http://marcosluis2186.posterous.com http://www.uci.cu/
Getting different results every time I run the same job on the cluster
Hi, I have to admit, I am lost. My code http://frd.org/ is stable on a pseudo distributed cluster, but every time I run it one a 4 - slave cluster, I get different results, ranging from 100 output lines to 4,000 output lines, whereas the real answer on my standalone is about 2000. I look at the logs and see no exceptions, so I am totally lost. Where should I look? Thank you, Mark
Re: Custom Seq File Loader: ClassNotFoundException
Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: Custom Seq File Loader: ClassNotFoundException
Unfortunately, public didn't change my error ... Any other ideas? Has anyone ran Hadoop on eclipse with custom sequence inputs ? Thank you, Mark On Mon, Mar 5, 2012 at 9:58 AM, Mark question markq2...@gmail.com wrote: Hi Madhu, it has the following line: TermDocFreqArrayWritable () {} but I'll try it with public access in case it's been called outside of my package. Thank you, Mark On Sun, Mar 4, 2012 at 9:55 PM, madhu phatak phatak@gmail.com wrote: Hi, Please make sure that your CustomWritable has a default constructor. On Sat, Mar 3, 2012 at 4:56 AM, Mark question markq2...@gmail.com wrote: Hello, I'm trying to debug my code through eclipse, which worked fine with given Hadoop applications (eg. wordcount), but as soon as I run it on my application with my custom sequence input file/types, I get: Java.lang.runtimeException.java.ioException (Writable name can't load class) SequenceFile$Reader.getValeClass(Sequence File.class) because my valueClass is customed. In other words, how can I add/build my CustomWritable class to be with hadoop LongWritable,IntegerWritable etc. Did anyone used eclipse? Mark -- Join me at http://hadoopworkshop.eventbrite.com/
Re: better partitioning strategy in hive
Sorry about the dealyed response, RK. Here is what I think: 1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? -- Hive is possibly taking a long time to figure out what partitions it needs to query. I experienced the same problem when I had a lot of partitions (with relatively small sized files). I reverted back to having less number of partitions with larger file sizes, that fixed the problem. Finding the balance between how many partitions you want and how big you want each partition to be is tricky, but, in general, it's better to have lesser number of partitions. You want to be aware of the small files problem. It has been discussed at many places. Some links are: http://blog.rapleaf.com/dev/2008/11/20/give-me-liberty-or-give-me-death-but-dont-give-me-small-files/ http://www.cloudera.com/blog/2009/02/the-small-files-problem/ http://arunxjacob.blogspot.com/2011/04/hdfs-file-size-vs-allocation-other.html 2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? -- I would say that's a good step forward. 3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? -- You probably wouldn't have the same issue; if you still do, it wouldn't be as bad. Since the number of partitions have been reduced by a factor of 24, hive doesn't have to go through as many number of partitions. However, your queries that look for data in a particular hour on a given day would be slower now that you don't have hour as a partition. 4)Are there any special input formats or tricks to handle this? -- This is a separate question. What format, SerDe and compression you use for your data, is a part of the design but isn't necessarily linked to the problem in question. 5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? -- mapred.reduce.tasks wouldn't help because the job is map-only and has no reduce tasks. You should look into hive.merge.* properties. Setting them in your hive-site.xml would do the trick. You can see refer to this template (https://svn.apache.org/repos/asf/hive/trunk/conf/hive-default.xml.template) to see what properties exist. Good luck! Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com Best Trading Platform - World Finance's Forex Awards 2009. The One to Watch - Treasury Today's Adam Smith Awards 2009. - Original Message - From: rk vishu talk2had...@gmail.com To: cdh-u...@cloudera.org, common-user@hadoop.apache.org, u...@hive.apache.org Sent: Saturday, February 18, 2012 4:39:48 AM Subject: Re: better partitioning strategy in hive Hello All, We have a hive table partitioned by date and hour(330 columns). We have 5 years worth of data for the table. Each hourly partition have around 800MB. So total 43,800 partitions with one file per partition. When we run select count(*) from table, hive is taking for ever to submit the job. I waited for 20 min and killed it. If i run for a month it takes little time to submit the job, but at least hive is able to get the work done?. Questions: 1) first of all why hive is not able to even submit the job? Is it taking for ever to query the list pf partitions from the meta store? getting 43K recs should not be big deal at all?? 2) So in order to improve my situation, what are my options? I can think of changing the partition strategy to daily partition instead of hourly. What should be the ideal partitioning strategy? 3) if we have one partition per day and 24 files under it (i.e less partitions but same number of files), will it improve anything or i will have same issue ? 4)Are there any special input formats or tricks to handle this? 5) When i tried to insert into a different table by selecting from whole days data, hive generate 164mappers with map-only jobs, hence creating many output files. How can force hive to create one output file instead of many. Setting mapred.reduce.tasks=1 is not even generating reduce tasks. What i can do to achieve this? -RK
Re: Streaming Hadoop using C
Starfish worked great for wordcount .. I didn't run it on my application because I have only map tasks. Mark On Thu, Mar 1, 2012 at 4:34 AM, Charles Earl charles.ce...@gmail.comwrote: How was your experience of starfish? C On Mar 1, 2012, at 12:35 AM, Mark question wrote: Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.com wrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Streaming Hadoop using C
Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.comwrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Streaming Hadoop using C
Thank you for your time and suggestions, I've already tried starfish, but not jmap. I'll check it out. Thanks again, Mark On Wed, Feb 29, 2012 at 1:17 PM, Charles Earl charles.ce...@gmail.comwrote: I assume you have also just tried running locally and using the jdk performance tools (e.g. jmap) to gain insight by configuring hadoop to run absolute minimum number of tasks? Perhaps the discussion http://grokbase.com/t/hadoop/common-user/11ahm67z47/how-do-i-connect-java-visual-vm-to-a-remote-task might be relevant? On Feb 29, 2012, at 3:53 PM, Mark question wrote: I've used hadoop profiling (.prof) to show the stack trace but it was hard to follow. jConsole locally since I couldn't find a way to set a port number to child processes when running them remotely. Linux commands (top,/proc), showed me that the virtual memory is almost twice as my physical which means swapping is happening which is what I'm trying to avoid. So basically, is there a way to assign a port to child processes to monitor them remotely (asked before by Xun) or would you recommend another monitoring tool? Thank you, Mark On Wed, Feb 29, 2012 at 11:35 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, So if I understand, it is more the memory management that you are interested in, rather than a need to run an existing C or C++ application in MapReduce platform? Have you done profiling of the application? C On Feb 29, 2012, at 2:19 PM, Mark question wrote: Thanks Charles .. I'm running Hadoop for research to perform duplicate detection methods. To go deeper, I need to understand what's slowing my program, which usually starts with analyzing memory to predict best input size for map task. So you're saying piping can help me control memory even though it's running on VM eventually? Thanks, Mark On Wed, Feb 29, 2012 at 11:03 AM, Charles Earl charles.ce...@gmail.com wrote: Mark, Both streaming and pipes allow this, perhaps more so pipes at the level of the mapreduce task. Can you provide more details on the application? On Feb 29, 2012, at 1:56 PM, Mark question wrote: Hi guys, thought I should ask this before I use it ... will using C over Hadoop give me the usual C memory management? For example, malloc() , sizeof() ? My guess is no since this all will eventually be turned into bytecode, but I need more control on memory which obviously is hard for me to do with Java. Let me know of any advantages you know about streaming in C over hadoop. Thank you, Mark
Re: Clickstream and video Analysis
http://www.wibidata.com/ Only it's not open source :) You can research the story by looking at http://www.youtube.com/watch?v=pUogubA9CEA to start Mark On Wed, Feb 22, 2012 at 11:30 PM, shreya@cognizant.com wrote: Hi, Could someone provide some links on Clickstream and video Analysis in Hadoop. Thanks and Regards, Shreya Pal This e-mail and any files transmitted with it are for the sole use of the intended recipient(s) and may contain confidential and privileged information. If you are not the intended recipient, please contact the sender by reply e-mail and destroy all copies of the original message. Any unauthorized review, use, disclosure, dissemination, forwarding, printing or copying of this email or any action taken in reliance on this e-mail is strictly prohibited and may be unlawful.
Is default number of reducers = 1?
Hi, I used to do job.setNumReduceTasks(1); but I realized that this is bad and commented out this line //job.setNumReduceTasks(1); I still see the number of reduce tasks as 1 when my mappers number 4. Why could this be? Thank you, Mark
Re: memory of mappers and reducers
Great! thanks a lot Srinivas ! Mark On Thu, Feb 16, 2012 at 7:02 AM, Srinivas Surasani vas...@gmail.com wrote: 1) Yes option 2 is enough. 2) Configuration variable mapred.child.ulimit can be used to control the maximum virtual memory of the child (map/reduce) processes. ** value of mapred.child.ulimit value of mapred.child.java.opts On Thu, Feb 16, 2012 at 12:38 AM, Mark question markq2...@gmail.com wrote: Thanks for the reply Srinivas, so option 2 will be enough, however, when I tried setting it to 512MB, I see through the system monitor that the map task is given 275MB of real memory!! Is that normal in hadoop to go over the upper bound of memory given by the property mapred.child.java.opts. Mark On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote: Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com -- -- Srinivas srini...@cloudwick.com
memory of mappers and reducers
Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark
Re: memory of mappers and reducers
Thanks for the reply Srinivas, so option 2 will be enough, however, when I tried setting it to 512MB, I see through the system monitor that the map task is given 275MB of real memory!! Is that normal in hadoop to go over the upper bound of memory given by the property mapred.child.java.opts. Mark On Wed, Feb 15, 2012 at 4:00 PM, Srinivas Surasani vas...@gmail.com wrote: Hey Mark, Yes, you can limit the memory for each task with mapred.child.java.opts property. Set this to final if no developer has to change it . Little intro to mapred.task.default.maxvmem This property has to be set on both the JobTracker for making scheduling decisions and on the TaskTracker nodes for the sake of memory management. If a job doesn't specify its virtual memory requirement by setting mapred.task.maxvmem to -1, tasks are assured a memory limit set to this property. This property is set to -1 by default. This value should in general be less than the cluster-wide configuration mapred.task.limit.maxvmem. If not or if it is not set, TaskTracker's memory management will be disabled and a scheduler's memory based scheduling decisions may be affected. On Wed, Feb 15, 2012 at 5:57 PM, Mark question markq2...@gmail.com wrote: Hi, My question is what's the difference between the following two settings: 1. mapred.task.default.maxvmem 2. mapred.child.java.opts The first one is used by the TT to monitor the memory usage of tasks, while the second one is the maximum heap space assigned for each task. I want to limit each task to use upto say 100MB of memory. Can I use only #2 ?? Thank you, Mark -- -- Srinivas srini...@cloudwick.com
Namenode no lease exception ... what does it mean?
Hi guys, Even though there is enough space on HDFS as shown by -report ... I get the following 2 error shown first in the log of a datanode and the second on Namenode log: 1)2012-02-09 10:18:37,519 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addToInvalidates: blk_8448117986822173955 is added to invalidSet of 10.0.40.33:50010 2) 2012-02-09 10:18:41,788 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: addStoredBlock request received for blk_132544693472320409_2778 on 10.0.40.12:50010 size 67108864 But it does not belong to any file. 2012-02-09 10:18:41,789 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 12123, call addBlock(/user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247, DFSClient_attempt_201202090811_0005_m_000247_0) from 10.0.40.12:34103: error: org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247 File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0 does not have any open files. org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException: No lease on /user/mark/output33/_temporary/_attempt_201202090811_0005_m_000247_0/part-00247 File does not exist. Holder DFSClient_attempt_201202090811_0005_m_000247_0 does not have any open files. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1332) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:1323) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1251) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor13.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) Any other ways to debug this? Thanks, Mark
Re: How to set up output field separator?
Harsh, I think it worked in Hadoop 0.20, but it does not work with the new mapreduce API, and even this key, mapreduce.output.textoutputformat.separator, does not help. Maybe I should switch back to 0.20 for the time being. Mark On Tue, Feb 7, 2012 at 10:27 AM, Harsh J ha...@cloudera.com wrote: That property is probably just for streaming, used with KeyFieldBasedComparator/Partitioner. You may instead set mapred.textoutputformat.separator for the TextOutputFormat in regular jobs. Let us know if that works. On Tue, Feb 7, 2012 at 7:57 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, all, I've tried this configuration.set(map.output.key.field.separator, ,); but it did not work. How do I set the separator to another field, from its default tab? Thank you, Mark -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Re: Can't achieve load distribution
Praveen, this seems just like the right thing, but it's API 0.21 (I googled about the problems with it), so I have to use either the next Cloudera release, or Hortonworks, or something, am I right? Mark On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati praveensrip...@gmail.comwrote: I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Use the NLineInputFormat class. http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html Praveen On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Thanks! Mark On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote: Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop. Best Regards, Anil On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: Can't achieve load distribution
And that is exactly what I found. I have a hack for now - give all files on the command line - and I will wait for the next release in some distribution. Thank you, Mark On Thu, Feb 2, 2012 at 9:55 PM, Harsh J ha...@cloudera.com wrote: New API NLineInputFormat is only available from 1.0.1, and not in any of the earlier 1 (1.0.0) or 0.20 (0.20.x, 0.20.xxx) vanilla Apache releases. On Fri, Feb 3, 2012 at 7:08 AM, Praveen Sripati praveensrip...@gmail.com wrote: Mark, NLineInputFormat was not something which was introduced in 0.21, I have just sent the reference to the 0.21 url FYI. It's in 0.20.205, 1.0.0 and 0.23 releases also. Praveen On Fri, Feb 3, 2012 at 1:25 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Praveen, this seems just like the right thing, but it's API 0.21 (I googled about the problems with it), so I have to use either the next Cloudera release, or Hortonworks, or something, am I right? Mark On Thu, Feb 2, 2012 at 7:39 AM, Praveen Sripati praveensrip...@gmail.com wrote: I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Use the NLineInputFormat class. http://hadoop.apache.org/mapreduce/docs/r0.21.0/api/org/apache/hadoop/mapreduce/lib/input/NLineInputFormat.html Praveen On Thu, Feb 2, 2012 at 9:43 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Thanks! Mark On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote: Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop. Best Regards, Anil On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark -- Harsh J Customer Ops. Engineer Cloudera | http://tiny.cloudera.com/about
Can't achieve load distribution
Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: Can't achieve load distribution
Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: Can't achieve load distribution
Thanks! Mark On Wed, Feb 1, 2012 at 7:44 PM, Anil Gupta anilgupt...@gmail.com wrote: Yes, if ur block size is 64mb. Btw, block size is configurable in Hadoop. Best Regards, Anil On Feb 1, 2012, at 5:06 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Anil, do you mean one block of HDFS, like 64MB? Mark On Wed, Feb 1, 2012 at 7:03 PM, Anil Gupta anilgupt...@gmail.com wrote: Do u have enough data to start more than one mapper? If entire data is less than a block size then only 1 mapper will run. Best Regards, Anil On Feb 1, 2012, at 4:21 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, I have a simple MR job, and I want each Mapper to get one line from my input file (which contains further instructions for lengthy processing). Each line is 100 characters long, and I tell Hadoop to read only 100 bytes, job.getConfiguration().setInt(mapreduce.input.linerecordreader.line.maxlength, 100); I see that this part works - it reads only one line at a time, and if I change this parameter, it listens. However, on a cluster only one node receives all the map tasks. Only one map tasks is started. The others never get anything, they just wait. I've added 100 seconds wait to the mapper - no change! Any advice? Thank you. Sincerely, Mark
Re: Too many open files Error
Hi Harsh and Idris ... so the only drawback for increasing the value of xcievers is memory issue? In that case then I'll set it to a value that doesn't fill the memory ... Thanks, Mark On Thu, Jan 26, 2012 at 10:37 PM, Idris Ali psychid...@gmail.com wrote: Hi Mark, As Harsh pointed out it is not good idea to increase the Xceiver count to arbitrarily higher value, I suggested to increase the xceiver count just to unblock execution of your program temporarily. Thanks, -Idris On Fri, Jan 27, 2012 at 10:39 AM, Harsh J ha...@cloudera.com wrote: You are technically allowing DN to run 1 million block transfer (in/out) threads by doing that. It does not take up resources by default sure, but now it can be abused with requests to make your DN run out of memory and crash cause its not bound to proper limits now. On Fri, Jan 27, 2012 at 5:49 AM, Mark question markq2...@gmail.com wrote: Harsh, could you explain briefly why is 1M setting for xceiver is bad? the job is working now ... about the ulimit -u it shows 200703, so is that why connection is reset by peer? How come it's working with the xceiver modification? Thanks, Mark On Thu, Jan 26, 2012 at 12:21 PM, Harsh J ha...@cloudera.com wrote: Agree with Raj V here - Your problem should not be the # of transfer threads nor the number of open files given that stacktrace. And the values you've set for the transfer threads are far beyond recommendations of 4k/8k - I would not recommend doing that. Default in 1.0.0 is 256 but set it to 2048/4096, which are good value to have when noticing increased HDFS load, or when running services like HBase. You should instead focus on why its this particular job (or even particular task, which is important to notice) that fails, and not other jobs (or other task attempts). On Fri, Jan 27, 2012 at 1:10 AM, Raj V rajv...@yahoo.com wrote: Mark You have this Connection reset by peer. Why do you think this problem is related to too many open files? Raj From: Mark question markq2...@gmail.com To: common-user@hadoop.apache.org Sent: Thursday, January 26, 2012 11:10 AM Subject: Re: Too many open files Error Hi again, I've tried : property namedfs.datanode.max.xcievers/name value1048576/value /property but I'm still getting the same error ... how high can I go?? Thanks, Mark On Thu, Jan 26, 2012 at 9:29 AM, Mark question markq2...@gmail.com wrote: Thanks for the reply I have nothing about dfs.datanode.max.xceivers on my hdfs-site.xml so hopefully this would solve the problem and about the ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop with a single bin/start-all.sh ... Do you think I can add it by bin/Datanode -ulimit n ? Mark On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.com wrote: U need to set ulimit -n bigger value on datanode and restart datanodes. Sent from my iPhone On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote: Hi Mark, On a lighter note what is the count of xceivers? dfs.datanode.max.xceivers property in hdfs-site.xml? Thanks, -idris On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.comwrote: Sorry going from memory... As user Hadoop or mapred or hdfs what do you see when you do a ulimit -a? That should give you the number of open files allowed by a single user... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote: Hi guys, I get this error from a job trying to process 3Million records. java.io.IOException: Bad connect ack with firstBadLink 192.168.1.20:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) When I checked the logfile of the datanode-20, I see : 2012-01-26 03:00:11,827 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.1.20:50010, storageID=DS-97608578-192.168.1.20-50010-1327575205369, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native
Re: Too many open files Error
Thanks for the reply I have nothing about dfs.datanode.max.xceivers on my hdfs-site.xml so hopefully this would solve the problem and about the ulimit -n , I'm running on an NFS cluster, so usually I just start Hadoop with a single bin/start-all.sh ... Do you think I can add it by bin/Datanode -ulimit n ? Mark On Thu, Jan 26, 2012 at 7:33 AM, Mapred Learn mapred.le...@gmail.comwrote: U need to set ulimit -n bigger value on datanode and restart datanodes. Sent from my iPhone On Jan 26, 2012, at 6:06 AM, Idris Ali psychid...@gmail.com wrote: Hi Mark, On a lighter note what is the count of xceivers? dfs.datanode.max.xceivers property in hdfs-site.xml? Thanks, -idris On Thu, Jan 26, 2012 at 5:28 PM, Michel Segel michael_se...@hotmail.com wrote: Sorry going from memory... As user Hadoop or mapred or hdfs what do you see when you do a ulimit -a? That should give you the number of open files allowed by a single user... Sent from a remote device. Please excuse any typos... Mike Segel On Jan 26, 2012, at 5:13 AM, Mark question markq2...@gmail.com wrote: Hi guys, I get this error from a job trying to process 3Million records. java.io.IOException: Bad connect ack with firstBadLink 192.168.1.20:50010 at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2903) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2826) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) When I checked the logfile of the datanode-20, I see : 2012-01-26 03:00:11,827 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration( 192.168.1.20:50010, storageID=DS-97608578-192.168.1.20-50010-1327575205369, infoPort=50075, ipcPort=50020):DataXceiver java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.net.SocketInputStream$Reader.performIO(SocketInputStream.java:55) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:142) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at java.io.BufferedInputStream.read1(BufferedInputStream.java:256) at java.io.BufferedInputStream.read(BufferedInputStream.java:317) at java.io.DataInputStream.read(DataInputStream.java:132) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readToBuf(BlockReceiver.java:262) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.readNextPacket(BlockReceiver.java:309) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receivePacket(BlockReceiver.java:373) at org.apache.hadoop.hdfs.server.datanode.BlockReceiver.receiveBlock(BlockReceiver.java:525) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.writeBlock(DataXceiver.java:357) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:103) at java.lang.Thread.run(Thread.java:662) Which is because I'm running 10 maps per taskTracker on a 20 node cluster, each map opens about 300 files so that should give 6000 opened files at the same time ... why is this a problem? the maximum # of files per process on one machine is: cat /proc/sys/fs/file-max --- 2403545 Any suggestions? Thanks, Mark
Re: Using S3 instead of HDFS
It worked, thank you, Harsh. Mark On Wed, Jan 18, 2012 at 1:16 AM, Harsh J ha...@cloudera.com wrote: Ah sorry about missing that. Settings would go in core-site.xml (hdfs-site.xml will no longer be relevant anymore, once you switch to using S3). On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote: That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and hdfs-site.xml, so which one do you put it in? Thank you (and good night Central Time:) mark On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote: When using S3 you do not need to run any component of HDFS at all. It is meant to be an alternate FS choice. You need to run only MR. The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on how to go about specifying your auth details to S3, either directly via the fs.default.name URI or via the additional properties fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work for you? On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out Exception in thread main java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3n://myname.testdata is not of scheme 'hdfs'. at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) ERROR. Could not start Hadoop secondarynamenode daemon And, if I don't need to start the NameNode, then where do I give the S3 credentials? Thank you, Mark On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote: Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark -- Harsh J Customer Ops. Engineer, Cloudera -- Harsh J Customer Ops. Engineer, Cloudera -- Harsh J Customer Ops. Engineer, Cloudera
Re: Using S3 instead of HDFS
Awesome important, Matt, thank you so much! Mark On Wed, Jan 18, 2012 at 10:53 AM, Matt Pouttu-Clarke matt.pouttu-cla...@icrossing.com wrote: I would strongly suggest using this method to read S3 only. I have had problems with writing large volumes of data to S3 from Hadoop using native s3fs. Supposedly a fix is on the way from Amazon (it is an undocumented internal error being thrown). However, this fix is already 2 months later than we expected it and we currently have no ETA. If you want to write data to S3 reliably, you should use the S3 API directly and stream data from HDFS into S3. Just remember that S3 requires the final size of the data before you start writing so it is not true streaming in that sense. After you have completed writing your part files in your job (writing to HDFS), you can write a map-only job to stream the data up into S3 using the S3 API directly. In no way, shape, or form should S3 be currently considered as a replacement for HDFS when it come to writes. Your jobs will need to be modified and customized to write to S3 reliably, there are files size limits on writes, and the multi-part upload option does not work correctly and randomly throws an internal Amazon error. You have been warned! -Matt On 1/18/12 9:37 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: It worked, thank you, Harsh. Mark On Wed, Jan 18, 2012 at 1:16 AM, Harsh J ha...@cloudera.com wrote: Ah sorry about missing that. Settings would go in core-site.xml (hdfs-site.xml will no longer be relevant anymore, once you switch to using S3). On 18-Jan-2012, at 12:36 PM, Mark Kerzner wrote: That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and hdfs-site.xml, so which one do you put it in? Thank you (and good night Central Time:) mark On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote: When using S3 you do not need to run any component of HDFS at all. It is meant to be an alternate FS choice. You need to run only MR. The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on how to go about specifying your auth details to S3, either directly via the fs.default.name URI or via the additional properties fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work for you? On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26 .out Exception in thread main java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3n://myname.testdata is not of scheme 'hdfs'. at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java: 224) at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNod e.java:209) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(Secon daryNameNode.java:182) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(Secondary NameNode.java:150) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNa meNode.java:624) ERROR. Could not start Hadoop secondarynamenode daemon And, if I don't need to start the NameNode, then where do I give the S3 credentials? Thank you, Mark On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote: Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark -- Harsh J Customer Ops. Engineer, Cloudera -- Harsh J Customer Ops. Engineer, Cloudera -- Harsh J Customer Ops. Engineer, Cloudera iCrossing Privileged and Confidential Information This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information of iCrossing. Any unauthorized review, use, disclosure
Using S3 instead of HDFS
Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark
Re: Using S3 instead of HDFS
Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out Exception in thread main java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3n://myname.testdata is not of scheme 'hdfs'. at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) ERROR. Could not start Hadoop secondarynamenode daemon And, if I don't need to start the NameNode, then where do I give the S3 credentials? Thank you, Mark On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote: Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark -- Harsh J Customer Ops. Engineer, Cloudera
Re: Using S3 instead of HDFS
That wiki page mentiones hadoop-site.xml, but this is old, now you have core-site.xml and hdfs-site.xml, so which one do you put it in? Thank you (and good night Central Time:) mark On Wed, Jan 18, 2012 at 12:52 AM, Harsh J ha...@cloudera.com wrote: When using S3 you do not need to run any component of HDFS at all. It is meant to be an alternate FS choice. You need to run only MR. The wiki page at http://wiki.apache.org/hadoop/AmazonS3 mentions on how to go about specifying your auth details to S3, either directly via the fs.default.name URI or via the additional properties fs.s3.awsAccessKeyId and fs.s3.awsSecretAccessKey. Does this not work for you? On Wed, Jan 18, 2012 at 12:14 PM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Well, here is my error message Starting Hadoop namenode daemon: starting namenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-ip-10-126-11-26.out ERROR. Could not start Hadoop namenode daemon Starting Hadoop secondarynamenode daemon: starting secondarynamenode, logging to /usr/lib/hadoop-0.20/logs/hadoop-hadoop-secondarynamenode-ip-10-126-11-26.out Exception in thread main java.lang.IllegalArgumentException: Invalid URI for NameNode address (check fs.default.name): s3n://myname.testdata is not of scheme 'hdfs'. at org.apache.hadoop.hdfs.server.namenode.NameNode.getAddress(NameNode.java:224) at org.apache.hadoop.hdfs.server.namenode.NameNode.getServiceAddress(NameNode.java:209) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.initialize(SecondaryNameNode.java:182) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.init(SecondaryNameNode.java:150) at org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode.main(SecondaryNameNode.java:624) ERROR. Could not start Hadoop secondarynamenode daemon And, if I don't need to start the NameNode, then where do I give the S3 credentials? Thank you, Mark On Wed, Jan 18, 2012 at 12:36 AM, Harsh J ha...@cloudera.com wrote: Hey Mark, What is the exact trouble you run into? What do the error messages indicate? This should be definitive enough I think: http://wiki.apache.org/hadoop/AmazonS3 On Wed, Jan 18, 2012 at 11:55 AM, Mark Kerzner mark.kerz...@shmsoft.com wrote: Hi, whatever I do, I can't make it work, that is, I cannot use s3://host or s3n://host as a replacement for HDFS while runnings EC2 cluster. I change the settings in the core-file.xml, in hdfs-site.xml, and start hadoop services, and it fails with error messages. Is there a place where this is clearly described? Thank you so much. Mark -- Harsh J Customer Ops. Engineer, Cloudera -- Harsh J Customer Ops. Engineer, Cloudera
Re: connection between slaves and master
exactly right. Thanks Praveen. Mark On Tue, Jan 10, 2012 at 1:54 AM, Praveen Sripati praveensrip...@gmail.comwrote: Mark, [mark@node67 ~]$ telnet node77 You need to specify the port number along with the server name like `telnet node77 1234`. 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). Slaves are not able to connect to the master. The configurations ` fs.default.name` and `mapred.job.tracker` should point to the master and not to localhost when the master and slaves are on different machines. Praveen On Mon, Jan 9, 2012 at 11:41 PM, Mark question markq2...@gmail.com wrote: Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
connection between slaves and master
Hello guys, I'm requesting from a PBS scheduler a number of machines to run Hadoop and even though all hadoop daemons start normally on the master and slaves, the slaves don't have worker tasks in them. Digging into that, there seems to be some blocking between nodes (?) don't know how to describe it except that on slave if I telnet master-node it should be able to connect, but I get this error: [mark@node67 ~]$ telnet node77 Trying 192.168.1.77... telnet: connect to address 192.168.1.77: Connection refused telnet: Unable to connect to remote host: Connection refused The log at the slave nodes shows the same thing, even though it has datanode and tasktracker started from the maste (?): 2012-01-09 10:04:03,436 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 0 time(s). 2012-01-09 10:04:04,439 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 1 time(s). 2012-01-09 10:04:05,442 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 2 time(s). 2012-01-09 10:04:06,444 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 3 time(s). 2012-01-09 10:04:07,446 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 4 time(s). 2012-01-09 10:04:08,448 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 5 time(s). 2012-01-09 10:04:09,450 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 6 time(s). 2012-01-09 10:04:10,452 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 7 time(s). 2012-01-09 10:04:11,454 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 8 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.Client: Retrying connect to server: localhost/127.0.0.1:12123. Already tried 9 time(s). 2012-01-09 10:04:12,456 INFO org.apache.hadoop.ipc.RPC: Server at localhost/ 127.0.0.1:12123 not available yet, Z... Any suggestions of what I can do? Thanks, Mark
Re: Expected file://// error
mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:10001/value /property property namemapred.child.java.opts/name value-Xmx1024m/value /property property namemapred.tasktracker.map.tasks.maximum/name value10/value /property /configuration Command is running a script which runs a java program that submit two jobs consecutively insuring waiting for the first job ( is working on my laptop but on the cluster). On the cluster I get: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The first job output is : folder_logs folderpart-0 I'm set folder as input path to the next job, could it be from the _logs ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster has hadoop-0.20.2. Thanks, Mark
Re: Expected file://// error
It's already in there ... don't worry about it, I'm submitting the first job then the second job manually for now. export CLASSPATH=/home/mark/hadoop-0.20.2/conf:$CLASSPATH export CLASSPATH=/home/mark/hadoop-0.20.2/lib:$CLASSPATH export CLASSPATH=/home/mark/hadoop-0.20.2/hadoop-0.20.2-core.jar:/home/mark/hadoop-0.20.2/lib/commons-cli-1.2.jar:$CLASSPATH Thank you for your time, Mark On Sun, Jan 8, 2012 at 12:22 PM, Joey Echeverria j...@cloudera.com wrote: What's the classpath of the java program submitting the job? It has to have the configuration directory (e.g. /opt/hadoop/conf) in there or it won't pick up the correct configs. -Joey On Sun, Jan 8, 2012 at 12:59 PM, Mark question markq2...@gmail.com wrote: mapred-site.xml: configuration property namemapred.job.tracker/name valuelocalhost:10001/value /property property namemapred.child.java.opts/name value-Xmx1024m/value /property property namemapred.tasktracker.map.tasks.maximum/name value10/value /property /configuration Command is running a script which runs a java program that submit two jobs consecutively insuring waiting for the first job ( is working on my laptop but on the cluster). On the cluster I get: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) The first job output is : folder_logs folderpart-0 I'm set folder as input path to the next job, could it be from the _logs ... ? but again it worked on my laptop under hadoop-0.21.0. The cluster has hadoop-0.20.2. Thanks, Mark -- Joseph Echeverria Cloudera, Inc. 443.305.9434
Expected file://// error
Hello, I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second one reads the output of the first which would look like: outputPath/part-0 outputPath/_logs But I get the error: 12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated filesystem name. Use hdfs://localhost:12123/ instead. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) This looks similar to the problem described here but for older versions than mine: https://issues.apache.org/jira/browse/HADOOP-5259 I tried applying that patch, but probably due to different versions didn't work. Can anyone help? Thank you, Mark
Re: Expected file://// error
Hi Harsh, thanks for the reply, you were right, I didn't have hdfs://, but even after inserting it I still get the error. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201061404_0003/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:304) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Mark On Fri, Jan 6, 2012 at 6:02 AM, Harsh J ha...@cloudera.com wrote: What is your fs.default.name set to? It should be set to hdfs://host:port and not just host:port. Can you ensure this and retry? On 06-Jan-2012, at 5:45 PM, Mark question wrote: Hello, I'm running two jobs on Hadoop-0.20.2 consecutively, such that the second one reads the output of the first which would look like: outputPath/part-0 outputPath/_logs But I get the error: 12/01/06 03:29:34 WARN fs.FileSystem: localhost:12123 is a deprecated filesystem name. Use hdfs://localhost:12123/ instead. java.lang.IllegalArgumentException: Wrong FS: hdfs://localhost:12123/tmp/hadoop-mark/mapred/system/job_201201060323_0005/job.jar, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:47) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:357) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:192) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1189) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1165) at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1137) at org.apache.hadoop.mapred.JobClient.configureCommandLineOptions(JobClient.java:657) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:761) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at Main.run(Main.java:301) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at Main.main(Main.java:53) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) This looks similar to the problem described here but for older versions than mine: https://issues.apache.org/jira/browse/HADOOP-5259 I tried applying that patch, but probably due to different versions didn't work. Can anyone help? Thank you, Mark
Re: Where do i see Sysout statements after building example ?
For me, they go two levels deeper - under 'userlogs' in logs, then in directory that stores the run logs. Here is what I see root@ip-10-84-123-125 :/var/log/hadoop/userlogs/job_201112120200_0010/attempt_201112120200_0010_r_02_0# ls log.index stderr stdout syslog and there, in stdout, I see my write statements. Mark On Tue, Dec 13, 2011 at 11:00 AM, Harsh J ha...@cloudera.com wrote: JobTracker sysouts would go to logs/*-jobtracker*.out On 13-Dec-2011, at 8:08 PM, ArunKumar wrote: HI guys ! I have a single node set up as per http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/ 1I have put some sysout statements in Jobtracker and wordcount (src/examples/org/..) code 2ant build 3Ran example jar with wordcount Where do i find the sysout statements ? i have seen in logs/ datanode,tasktracker,*.out files. Can anyone help me out ? Arun -- View this message in context: http://lucene.472066.n3.nabble.com/Where-do-i-see-Sysout-statements-after-building-example-tp3582467p3582467.html Sent from the Hadoop lucene-users mailing list archive at Nabble.com.
Best ways to look-up information?
Hi, I am planning a system to process information with Hadoop, and I will have a few look-up tables that each processing node will need to query. There are perhaps 20-50 such tables, and each has on the order of one million entries. Which is the best mechanism for this look-up? Memcache, HBase, JavaSpace, Lucene index, anything else? Thank you, Mark
Jetty exception while running Hadoop
Hi, I keep getting the exception below. I've rebuild my EC2 cluster completely, and verified it on small jobs, but I still get it once I run anything sizable. The job runs, but I only get one part-0 file, even though I have 4 nodes and would expect for output files. Any help please? Thank you, Mark 112120200_0004_m_06_0, duration: 629002475 2011-12-12 02:24:43,557 WARN org.apache.hadoop.mapred.TaskTracker: getMapOutput(attempt_201112120200_0004_m_07_0,0) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3788) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707) at javax.servlet.http.HttpServlet.service(HttpServlet.java:820) at org.mortbay.jetty.servlet.ServletHolder.handle(ServletHolder.java:511) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1221) at org.apache.hadoop.http.HttpServer$QuotingInputFilter.doFilter(HttpServer.java:829) at org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212) at org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399) at org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216) at org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182) at org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766) at org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450) at org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230) at org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152) at org.mortbay.jetty.Server.handle(Server.java:326) at org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542) at org.mortbay.jetty.HttpConnection$RequestHandler.headerComplete(HttpConnection.java:928) at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:549) at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:212) at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404) at org.mortbay.io.nio.SelectChannelEndPoint.run(SelectChannelEndPoint.java:410) at org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:89) at sun.nio.ch.IOUtil.write(IOUtil.java:60) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:450) at org.mortbay.io.nio.ChannelEndPoint.flush(ChannelEndPoint.java:171) at org.mortbay.io.nio.SelectChannelEndPoint.flush(SelectChannelEndPoint.java:221) at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:725) ... 27 more 2011-12-12 02:24:43,557 WARN org.mortbay.log: Committed before 410 getMapOutput(attempt_201112120200_0004_m_07_0,0) failed : org.mortbay.jetty.EofException at org.mortbay.jetty.HttpGenerator.flush(HttpGenerator.java:791) at org.mortbay.jetty.AbstractGenerator$Output.blockForOutput(AbstractGenerator.java:551) at org.mortbay.jetty.AbstractGenerator$Output.flush(AbstractGenerator.java:572) at org.mortbay.jetty.HttpConnection$Output.flush(HttpConnection.java:1012) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:651) at org.mortbay.jetty.AbstractGenerator$Output.write(AbstractGenerator.java:580) at org.apache.hadoop.mapred.TaskTracker$MapOutputServlet.doGet(TaskTracker.java:3788) at javax.servlet.http.HttpServlet.service(HttpServlet.java:707)
Connection reset by peer Error
Hi, I've been getting this error multiple times now, the namenode mentions something about peer resetting connection, but I don't know why this is happening, because I'm running on a single machine with 12 cores any ideas? The job starting running normally, which contains about 200 mappers each opens 200 files (one file at a time inside mapper code) then: .. . ... 11/11/20 06:27:52 INFO mapred.JobClient: map 55% reduce 0% 11/11/20 06:28:38 INFO mapred.JobClient: map 56% reduce 0% 11/11/20 06:29:18 INFO mapred.JobClient: Task Id : attempt_20200450_0001_m_ 000219_0, Status : FAILED org.apache.hadoop.ipc.RemoteException: java.io.IOException: File /user/mark/output/_temporary/_attempt_20200450_0001_m_000219_0/part-00219 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) at org.apache.hadoop.ipc.Client.call(Client.java:740) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy1.addBlock(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.addBlock(Unknown Source) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:2937) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2819) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2102) at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2288) ... ... Namenode Log: 2011-11-20 06:29:51,964 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_aldst=null perm=null 2011-11-20 06:29:52,039 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G13_12_aqdst=null perm=null 2011-11-20 06:29:52,178 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=opensrc=/user/mark/input/G14_10_andst=null perm=null 2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_-2308051162058662821_1643 size 20024660 2011-11-20 06:29:52,348 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/mark/output/_temporary/_attempt_20200450_0001_m_000222_0/part-00222 is closed by DFSClient_attempt_20200450_0001_m_000222_0 2011-11-20 06:29:52,351 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to blk_9206172750679206987_1639 size 51330092 2011-11-20 06:29:52,352 INFO org.apache.hadoop.hdfs.StateChange: DIR* NameSystem.completeFile: file /user/mark/output/_temporary/_attempt_20200450_0001_m_000226_0/part-00226 is closed by DFSClient_attempt_20200450_0001_m_000226_0 2011-11-20 06:29:52,416 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: ugi=mark,ucsb ip=/127.0.0.1cmd=create src=/user/mark/output/_temporary/_attempt_20200450_0001_m_000223_2/part-00223 dst=nullperm=mark:supergroup:rw-r--r-- 2011-11-20 06:29:52,430 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 12123: readAndProcess threw exception java.io.IOException:Connection reset by peer. Count of bytes read: 0 java.io.IOException: Connection reset by peer at sun.nio.ch.FileDispatcher.read0(Native Method) at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:21) at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:202) at sun.nio.ch.IOUtil.read(IOUtil.java:175) at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:243) at org.apache.hadoop.ipc.Server.channelRead(Server.java:1211
Upgrading master hardware
We will be adding more memory into our master node in the near future. We generally don't mind if our map/reduce jobs are unable to run for a short period but we are more concerned about the impact this may have on our HBase cluster. Will HBase continue to work will hadoops name-node and/or HMaster is down? If not where are some ways we could minimize our downtime? Thanks
reading Hadoop output messages
Hi all, I'm wondering if there is a way to get output messages that are printed from the main class of a Hadoop job. Usually 21 out.log would wok, but in this case it only saves the output messages printed in the main class before starting the job. What I want is the output messages that are printed also in the main class but after the job is done. For example: in my main class: try {JobClient.runJob(conf); } catch (Exception e) { e.printStackTrace();} //submit job to JT sLogger.info(\n Job Finished in + (System.currentTimeMillis() - startTime) / 6.0 + Minutes.); I can't see the last message unless I see the screen. Any ideas? Thank you, Mark
setGroupingComparatorClass
Hi, Hadoop experts, I've written my custom GroupComparator, and I want to tell Hadoop about it. Now, there is a call job.setGroupingComparatorClass(), but I only find it in mapreduce package of version 0.21. In prior versions, I see a similar call conf.setOutputValueGroupingComparator(GroupComparator.class); but it does not cause my GroupComparator to be being used. So my question is, should I change the code to use the mapreduce package (not a problem, since Cloudera has it backported to the current distribution), or is there a different, simpler way? Thank you. Sincerely, Mark
Re: setGroupingComparatorClass
Here is my GroupComparator. With it, I want to use just the part of my composite key, in order to say that all the keys that match in that part should go to the same reducer and be presented to the reducer with their values. So public class GroupComparator extends WritableComparator { public GroupComparator() { super(KeyTuple.class, true); } @Override public int compare(WritableComparable K1, WritableComparable K2) { KeyTuple t1 = (KeyTuple) K1; KeyTuple t2 = (KeyTuple) K2; return t1.getSku().compareTo(t2.getSku()); } } Then in the reducer I would expect many values, for all keys that I declared equal in my GroupComparator. public void reduce(KeyTuple key, IteratorText values, OutputCollectorText, Text output, Reporter reporter) throws IOException { System.out.println(Reducer key= + key); while (values.hasNext()) { Text value = values.next(); System.out.println(Reducer value = + value); } } Instead, I still get individual full keys with one value, and the debugger does not step into my GroupComparator. Thanks a bunch! Mark On Tue, Nov 1, 2011 at 1:32 PM, Harsh J ha...@cloudera.com wrote: Hey Mark, What problem do you see when you use JobConf#setOutputValueGroupingComparator(…) when writing jobs with the stable API? I've used it many times and it does get applied. On Tue, Nov 1, 2011 at 10:38 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, Hadoop experts, I've written my custom GroupComparator, and I want to tell Hadoop about it. Now, there is a call job.setGroupingComparatorClass(), but I only find it in mapreduce package of version 0.21. In prior versions, I see a similar call conf.setOutputValueGroupingComparator(GroupComparator.class); but it does not cause my GroupComparator to be being used. So my question is, should I change the code to use the mapreduce package (not a problem, since Cloudera has it backported to the current distribution), or is there a different, simpler way? Thank you. Sincerely, Mark -- Harsh J
Default Compression
I recently added the following to my core-site.xml property nameio.compression.codecs/name value org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property However when I try and test a simple MR job I am seeing the following errors in my log. java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:116) at org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:156) at org.apache.hadoop.mapreduce.lib.input.TextInputFormat.isSplitable(TextInputFormat.java:51) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:254) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) Aren't these codecs installed by default? If not, how would I enable them? Thanks
Re: Default Compression
That did it. Thanks On 10/31/11 12:52 PM, Joey Echeverria wrote: Try getting rid of the extra spaces and new lines. -Joey On Mon, Oct 31, 2011 at 1:49 PM, Markstatic.void@gmail.com wrote: I recently added the following to my core-site.xml property nameio.compression.codecs/name value org.apache.hadoop.io.compress.DefaultCodec, org.apache.hadoop.io.compress.GzipCodec, org.apache.hadoop.io.compress.BZip2Codec /value /property However when I try and test a simple MR job I am seeing the following errors in my log. java.lang.IllegalArgumentException: Compression codec org.apache.hadoop.io.compress.DefaultCodec not found. at org.apache.hadoop.io.compress.CompressionCodecFactory.getCodecClasses(CompressionCodecFactory.java:116) at org.apache.hadoop.io.compress.CompressionCodecFactory.init(CompressionCodecFactory.java:156) at org.apache.hadoop.mapreduce.lib.input.TextInputFormat.isSplitable(TextInputFormat.java:51) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:254) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:944) Aren't these codecs installed by default? If not, how would I enable them? Thanks
Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2
I have the same issue and the output of curl localhost:50030 is like yours, and I'm running on a remote cluster on pesudo-distributed mode. Can anyone help? Thanks, Mark On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui cassandral...@gmail.comwrote: Hi guys, I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1 on Amazon EC2 and while my node is healthy, I can't seem to get to the JobTracker GUI working. Running 'curl localhost:50030' from the CMD line returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon Security Group. MapReduce jobs are starting and completing successfully, so my Hadoop install is working fine. But when I try to access the web GUI from a Chrome browser on my local computer, I get nothing. Any thoughts? I tried some Google searches and even did a hail-mary Bing search, but none of them were fruitful. Some troubleshooting I did is below: [root@ip-10-86-x-x ~]# jps 1337 QuorumPeerMain 1494 JobTracker 1410 DataNode 1629 SecondaryNameNode 1556 NameNode 1694 TaskTracker 1181 HRegionServer 1107 HMaster 11363 Jps [root@ip-10-86-x-x ~]# curl localhost:50030 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/ html head titleHadoop Administration/title /head body h1Hadoop Administration/h1 ul lia href=jobtracker.jspJobTracker/a/li /ul /body /html
Re: Cannot access JobTracker GUI (port 50030) via web browser while running on Amazon EC2
Thank you, I'll try it. Mark On Mon, Oct 24, 2011 at 1:50 PM, Sameer Farooqui cassandral...@gmail.comwrote: Mark, We figured it out. It's an issue with RedHat's IPTables. You have to open up those ports: vim /etc/sysconfig/iptables Make the file look like this # Firewall configuration written by system-config-firewall # Manual customization of this file is not recommended. *filter :INPUT ACCEPT [0:0] :FORWARD ACCEPT [0:0] :OUTPUT ACCEPT [0:0] -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT -A INPUT -p icmp -j ACCEPT -A INPUT -i lo -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 22 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 80 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50030 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50060 -j ACCEPT -A INPUT -m state --state NEW -m tcp -p tcp --dport 50070 -j ACCEPT -A INPUT -j REJECT --reject-with icmp-host-prohibited -A FORWARD -j REJECT --reject-with icmp-host-prohibited COMMIT Restart the web services /etc/init.d/iptables restart iptables: Flushing firewall rules: [ OK ] iptables: Setting chains to policy ACCEPT: filter [ OK ] iptables: Unloading modules: [ OK ] iptables: Applying firewall rules: [ OK ] On Mon, Oct 24, 2011 at 1:37 PM, Mark question markq2...@gmail.com wrote: I have the same issue and the output of curl localhost:50030 is like yours, and I'm running on a remote cluster on pesudo-distributed mode. Can anyone help? Thanks, Mark On Mon, Oct 24, 2011 at 11:02 AM, Sameer Farooqui cassandral...@gmail.comwrote: Hi guys, I'm running a 1-node Hadoop 0.20.2 pseudo-distributed node with RedHat 6.1 on Amazon EC2 and while my node is healthy, I can't seem to get to the JobTracker GUI working. Running 'curl localhost:50030' from the CMD line returns a valid HTML file. Ports 50030, 50060, 50070 are open in the Amazon Security Group. MapReduce jobs are starting and completing successfully, so my Hadoop install is working fine. But when I try to access the web GUI from a Chrome browser on my local computer, I get nothing. Any thoughts? I tried some Google searches and even did a hail-mary Bing search, but none of them were fruitful. Some troubleshooting I did is below: [root@ip-10-86-x-x ~]# jps 1337 QuorumPeerMain 1494 JobTracker 1410 DataNode 1629 SecondaryNameNode 1556 NameNode 1694 TaskTracker 1181 HRegionServer 1107 HMaster 11363 Jps [root@ip-10-86-x-x ~]# curl localhost:50030 meta HTTP-EQUIV=REFRESH content=0;url=jobtracker.jsp/ html head titleHadoop Administration/title /head body h1Hadoop Administration/h1 ul lia href=jobtracker.jspJobTracker/a/li /ul /body /html
Remote Blocked Transfer count
Hello, I wonder if there is a way to measure how many of the data blocks have transferred over the network? Or more generally, how many times where there a connection/contact between different machines? I thought of checking the Namenode log file which usually shows blk_ from src= to dst ... but I'm not sure if it's correct to count those lines. Any ideas are helpful. Mark
fixing the mapper percentage viewer
Hi all, I'm written a custom mapRunner, but it seems to have ruined the percentage shown for maps on console. I want to know which part of code is responsible for adjusting the percentage of maps ... Is it the following in MapRunner: if(incrProcCount) { reporter.incrCounter(SkipBadRecords.COUNTER_GROUP, SkipBadRecords.COUNTER_MAP_PROCESSED_RECORDS, 1); Thank you, Mark
Re: hadoop input buffer size
Thanks for the clarifications guys :) Mark On Mon, Oct 10, 2011 at 8:27 AM, Uma Maheswara Rao G 72686 mahesw...@huawei.com wrote: I think below can give you more info about it. http://developer.yahoo.com/blogs/hadoop/posts/2009/08/the_anatomy_of_hadoop_io_pipel/ Nice explanation by Owen here. Regards, Uma - Original Message - From: Yang Xiaoliang yangxiaoliang2...@gmail.com Date: Wednesday, October 5, 2011 4:27 pm Subject: Re: hadoop input buffer size To: common-user@hadoop.apache.org Hi, Hadoop neither read one line each time, nor fetching dfs.block.size of lines into a buffer, Actually, for the TextInputFormat, it read io.file.buffer.size bytes of text into a buffer each time, this can be seen from the hadoop source file LineReader.java 2011/10/5 Mark question markq2...@gmail.com Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
hadoop input buffer size
Hello, Correct me if I'm wrong, but when a program opens n-files at the same time to read from, and start reading from each file at a time 1 line at a time. Isn't hadoop actually fetching dfs.block.size of lines into a buffer? and not actually one line. If this is correct, I set up my dfs.block.size = 3MB and each line takes about 650 bytes only, then I would assume the performance for reading 1-4000 lines would be the same, but it isn't ! Do you know a way to find #n of lines to be read at once? Thank you, Mark
How to run Hadoop in standalone mode in Windows
Hi, I have cygwin, and I have NetBeans, and I have a maven Hadoop project that works on Linux. How do I combine them to work in Windows? Thank you, Mark
Am i crazy? - question about hadoop streaming
Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is 0.21, for such things as import org.apache.hadoop.mapreduce.Reducer; So I am using mapreduce, not mapred, and everything works fine. However, in a small streaming job, trying it out with Java classes first, I get this error Exception in thread main java.lang.RuntimeException: class mypackage.Map not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a mapreduce.Mapper. So it seems that Cloudera backports some of the advances but for streaming it is still the old API. So it is me or the world? Thank you, Mark
Re: Am i crazy? - question about hadoop streaming
I am sorry, you are right. mark On Wed, Sep 14, 2011 at 9:52 PM, Konstantin Boudnik c...@apache.org wrote: I am sure if you ask at provider's specific list you'll get a better answer than from common Hadoop list ;) Cos On Wed, Sep 14, 2011 at 09:48PM, Mark Kerzner wrote: Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is 0.21, for such things as import org.apache.hadoop.mapreduce.Reducer; So I am using mapreduce, not mapred, and everything works fine. However, in a small streaming job, trying it out with Java classes first, I get this error Exception in thread main java.lang.RuntimeException: class mypackage.Map not org.apache.hadoop.mapred.Mapper -- which it really is not, it is a mapreduce.Mapper. So it seems that Cloudera backports some of the advances but for streaming it is still the old API. So it is me or the world? Thank you, Mark -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.10 (GNU/Linux) iF4EAREIAAYFAk5xaGIACgkQenyFlstYjhKtZAEAmNtHK9DqBFmZ2DTJgAxEbF+p P0Tek1iW1P1ZwlqGDRIA/AuVVaNiul1bQM0NRYuAVxLn7sJOTSCQG5PRGJUQdvjq =Z/hO -END PGP SIGNATURE-
Re: Am i crazy? - question about hadoop streaming
Thank you, Prashant, it seems so. I already verified this by refactoring the code to use 0.20 API as well as 0.21 API in two different packages, and streaming happily works with 0.20. Mark On Wed, Sep 14, 2011 at 11:46 PM, Prashant prashan...@imaginea.com wrote: On 09/15/2011 08:18 AM, Mark Kerzner wrote: Hi, I am using the latest Cloudera distribution, and with that I am able to use the latest Hadoop API, which I believe is 0.21, for such things as import org.apache.hadoop.mapreduce.**Reducer; So I am using mapreduce, not mapred, and everything works fine. However, in a small streaming job, trying it out with Java classes first, I get this error Exception in thread main java.lang.RuntimeException: class mypackage.Map not org.apache.hadoop.mapred.**Mapper -- which it really is not, it is a mapreduce.Mapper. So it seems that Cloudera backports some of the advances but for streaming it is still the old API. So it is me or the world? Thank you, Mark The world!
Too many maps?
Hi, I am testing my Hadoop-based FreeEed http://frd.org/, an open source tool for eDiscovery, and I am using the Enron data sethttp://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2for that. In my processing, each email with its attachments becomes a map, and it is later collected by a reducer and written to the output. With the (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails of about 50,000. I remember in Yahoo best practices that the number of maps should not exceed 75,000, and I can see that I can break this barrier soon. I could, potentially, combine a few emails into one map, but I would be doing it only to circumvent the size problem, not because my processing requires it. Besides, my keys are the MD5 hashes of the files, and I use them to find duplicates. If I combine a few emails into a map, I cannot use the hashes as keys in a meaningful way anymore. So my question is, can't I have millions of maps, if that's how many artifacts I need to process, and why not? Thank you. Sincerely, Mark
Re: Too many maps?
Harsh, I read one PST file, which contains many emails. But then I emit many maps, like this MapWritable mapWritable = createMapWritable(metadata, fileName); // use MD5 of the input file as Hadoop key FileInputStream fileInputStream = new FileInputStream(fileName); MD5Hash key = MD5Hash.digest(fileInputStream); fileInputStream.close(); // emit map context.write(key, mapWritable); and it is this context.write calls that I have a great number of. Is that a problem? Mark On Tue, Sep 6, 2011 at 10:06 PM, Harsh J ha...@cloudera.com wrote: You can use an input format that lets you read multiple files per map (like say, all local files. See CombineFileInputFormat for one implementation that does this). This way you get reduced map #s and you don't really have to clump your files. One record reader would be initialized per file, so I believe you should be free to generate unique identities per file/email with this approach (whenever a new record reader is initialized)? On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am testing my Hadoop-based FreeEed http://frd.org/, an open source tool for eDiscovery, and I am using the Enron data sethttp://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2 for that. In my processing, each email with its attachments becomes a map, and it is later collected by a reducer and written to the output. With the (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails of about 50,000. I remember in Yahoo best practices that the number of maps should not exceed 75,000, and I can see that I can break this barrier soon. I could, potentially, combine a few emails into one map, but I would be doing it only to circumvent the size problem, not because my processing requires it. Besides, my keys are the MD5 hashes of the files, and I use them to find duplicates. If I combine a few emails into a map, I cannot use the hashes as keys in a meaningful way anymore. So my question is, can't I have millions of maps, if that's how many artifacts I need to process, and why not? Thank you. Sincerely, Mark -- Harsh J
Re: Too many maps?
Thank you, Sonal, at least that big job I was looking at just finished :) Mark On Tue, Sep 6, 2011 at 11:56 PM, Sonal Goyal sonalgoy...@gmail.com wrote: Mark, Having a large number of emitted key values from the mapper should not be a problem. Just make sure that you have enough reducers to handle the data so that the reduce stage does not become a bottleneck. Best Regards, Sonal Crux: Reporting for HBase https://github.com/sonalgoyal/crux Nube Technologies http://www.nubetech.co http://in.linkedin.com/in/sonalgoyal On Wed, Sep 7, 2011 at 8:44 AM, Mark Kerzner markkerz...@gmail.com wrote: Harsh, I read one PST file, which contains many emails. But then I emit many maps, like this MapWritable mapWritable = createMapWritable(metadata, fileName); // use MD5 of the input file as Hadoop key FileInputStream fileInputStream = new FileInputStream(fileName); MD5Hash key = MD5Hash.digest(fileInputStream); fileInputStream.close(); // emit map context.write(key, mapWritable); and it is this context.write calls that I have a great number of. Is that a problem? Mark On Tue, Sep 6, 2011 at 10:06 PM, Harsh J ha...@cloudera.com wrote: You can use an input format that lets you read multiple files per map (like say, all local files. See CombineFileInputFormat for one implementation that does this). This way you get reduced map #s and you don't really have to clump your files. One record reader would be initialized per file, so I believe you should be free to generate unique identities per file/email with this approach (whenever a new record reader is initialized)? On Wed, Sep 7, 2011 at 7:12 AM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I am testing my Hadoop-based FreeEed http://frd.org/, an open source tool for eDiscovery, and I am using the Enron data set http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2 for that. In my processing, each email with its attachments becomes a map, and it is later collected by a reducer and written to the output. With the (PST) mailboxes of around 2-5 Gigs, I begin to the see the numbers of emails of about 50,000. I remember in Yahoo best practices that the number of maps should not exceed 75,000, and I can see that I can break this barrier soon. I could, potentially, combine a few emails into one map, but I would be doing it only to circumvent the size problem, not because my processing requires it. Besides, my keys are the MD5 hashes of the files, and I use them to find duplicates. If I combine a few emails into a map, I cannot use the hashes as keys in a meaningful way anymore. So my question is, can't I have millions of maps, if that's how many artifacts I need to process, and why not? Thank you. Sincerely, Mark -- Harsh J
Re: tutorial on Hadoop/Hbase utility classes
Thank you, Sujee. StringUtils are useful, but so is Guava Mark On Wed, Aug 31, 2011 at 6:57 PM, Sujee Maniyam su...@sujee.net wrote: Here is a tutorial on some handy Hadoop classes - with sample source code. http://sujee.net/tech/articles/hadoop-useful-classes/ Would appreciate any feedback / suggestions. thanks all Sujee Maniyam http://sujee.net
Inaugural Indianapolis HUG Aug. 23 @ ChaCha
Hey all, I'd like to announce the inaugural meetup of the Indianapolis Hadoop User Group (IndyHUG), which will take place on August 23 at ChaCha Search Inc. The initial topic of discussion will be an intro to MapReduce, but we'll get as in-depth as the attendees would like. ChaCha has a nice area available for meetup space, and will be providing refreshments. We'll get things started around 6:00 pm. You can find more info and RSVP at http://www.meetup.com/IndyHUG/ If you live in the Indianapolis area or plan to be in the area at that time, please RSVP and stop by. We hope to see you then! -Mark Stetzer
The best architecture for EC2/Hadoop interface?
Hi, I want to give my users a GUI that would allow them to start Hadoop clusters and run applications that I will provide on the AMIs. What would be a good approach to make it simple for the user? Should I write a Java Swing app that will wrap around the EC2 commands? Should I use some more direct EC2 API? Or should I use a web browser interface? My idea was to give the user a Java Swing GUI, so that he gives his Amazon credentials to it, and it would be secure because the application is not exposed to the outside. Does this approach make sense? Thank you, Mark My project for which I want to do it: https://github.com/markkerzner/FreeEed
Re: First open source Predictive modeling framework on Apache hadoop
Congratulations, looks very interesting. Mark On Sun, Jul 24, 2011 at 1:15 AM, madhu phatak phatak@gmail.com wrote: Hi, We released Nectar,first open source predictive modeling on Apache Hadoop. Please check it out. Info page http://zinniasystems.com/zinnia.jsp?lookupPage=blogs/nectar.jsp Git Hub https://github.com/zinnia-phatak-dev/Nectar/downloads Reagards Madhukara Phatak,Zinnia Systems
Mapper Progress
Hi, I have my custom MapRunner which apparently seemed to affect the progress report of the mapper and showing 100% while the mapper is still reading files to process. Where can I change/add a progress object to be shown in UI ? Thank you, Mark
Re: Which release to use?
Steve, this is so well said, do you mind if I repeat it here, http://shmsoft.blogspot.com/2011/07/hadoop-commercial-support-options.html Thank you, Mark On Fri, Jul 15, 2011 at 4:00 PM, Steve Loughran ste...@apache.org wrote: On 15/07/2011 15:58, Michael Segel wrote: Unfortunately the picture is a bit more confusing. Yahoo! is now HortonWorks. Their stated goal is to not have their own derivative release but to sell commercial support for the official Apache release. So those selling commercial support are: *Cloudera *HortonWorks *MapRTech *EMC (reselling MapRTech, but had announced their own) *IBM (not sure what they are selling exactly... still seems like smoke and mirrors...) *DataStax + Amazon, indirectly, that do their own derivative work of some release of Hadoop (which version is it based on?) I've used 0.21, which was the first with the new APIs and, with MRUnit, has the best test framework. For my small-cluster uses, it worked well. (oh, and I didn't care about security)
Can't start the namenode
Hi, when I am trying to start a namenode in pseudo-mode sudo /etc/init.d/hadoop-0.20-namenode start I get a permission error java.io.FileNotFoundException: /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log (Permission denied) However, it does create another log file in the same directory ls /usr/lib/hadoop-0.20/logs hadoop-hadoop-namenode-myservername.out I am using CDH3, what am I doing wrong? Thank you, Mark
Re: Can't start the namenode
I kind of found the problem. If I open the logs directory, I see that this log file is created by hdfs -rw-r--r-- 1 hdfs hdfs 1399 Jul 6 21:48 hadoop-hadoop-namenode-myservername.log whereas the rest of the logs are created by root, and they have no problem doing this. I can adjust permissions on the logs directory, but I would expect this automatics. On Wed, Jul 6, 2011 at 11:38 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, when I am trying to start a namenode in pseudo-mode sudo /etc/init.d/hadoop-0.20-namenode start I get a permission error java.io.FileNotFoundException: /usr/lib/hadoop-0.20/logs/hadoop-hadoop-namenode-myservername.log (Permission denied) However, it does create another log file in the same directory ls /usr/lib/hadoop-0.20/logs hadoop-hadoop-namenode-myservername.out I am using CDH3, what am I doing wrong? Thank you, Mark
Writing out a single file
Is there anyway I can write out the results of my mapreduce job into 1 local file... ie the opposite of getmerge? Thanks
Re: One file per mapper
Hi Govind, You should use overwrite your FileInputFormat isSplitable function in a class say myFileInputFormat extends FileInputFormat as follows: @Override public boolean isSplitable(FileSystem fs, Path filename){ return false; } Then one you use your myFileInputFormat class. To know the path, write the following in your mapper class: @Override public void configure(JobConf job) { Path inputPath = new Path(job.get(map.input.file)); } ~cheers, Mark On Tue, Jul 5, 2011 at 1:04 PM, Govind Kothari govindkoth...@gmail.comwrote: Hi, I am new to hadoop. I have a set of files and I want to assign each file to a mapper. Also in mapper there should be a way to know the complete path of the file. Can you please tell me how to do that ? Thanks, Govind -- Govind Kothari Graduate Student Dept. of Computer Science University of Maryland College Park ---Seek Excellence, Success will Follow ---
Re: Hadoop Summit - Poster 49
Ah, I just came from Santa Clara! Will there be sessions online? Thank you, Mark On Tue, Jun 28, 2011 at 2:43 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: Hello All, As you all know, tomorrow is the Hadoop Summit 2011. There will be many interesting talks tomorrow. Don't miss any talk if you want to see how long Hadoop progressed. Link: http://developer.yahoo.com/events/hadoopsummit2011 Among those many interesting talks or posters sessions, One small poster session is Hadoop Disk Fail Inplace. One of the common problems in managing Hadoop Cluster is disk failure. If you want to hear or share disk related problems in Hadoop, please visit us at Poster 49. I am very happy to share how we are dealing with disk failures and eager to learn from your experiences. Looking forward to meeting you all, Bharath
Re: Comparing two logs, finding missing records
Interesting, Bharath, I will look at these. Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join. -Bharath From: Mark Kerzner markkerz...@gmail.com To: Hadoop Discussion Group core-u...@hadoop.apache.org Sent: Saturday, June 25, 2011 9:39 PM Subject: Comparing two logs, finding missing records Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
Re: Comparing two logs, finding missing records
Bharath, how would a Pig query look like? Thank you, Mark On Sun, Jun 26, 2011 at 5:12 PM, Bharath Mundlapudi bharathw...@yahoo.comwrote: If you have Serde or PigLoader for your log format, probably Pig or Hive will be a quicker solution with the join. -Bharath From: Mark Kerzner markkerz...@gmail.com To: Hadoop Discussion Group core-u...@hadoop.apache.org Sent: Saturday, June 25, 2011 9:39 PM Subject: Comparing two logs, finding missing records Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
Comparing two logs, finding missing records
Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
Re: Comparing two logs, finding missing records
Kumar, thank you, that is the exact solution to my problem as I have formulated it. That's valid and it stands, but I should have added that the two logs each have time stamps and that we are looking for missing records with time stamps in reasonable proximity. I have come up with a solution where I make rounded time as the key, and then in the reducer sort all records that fall within the rounded time, and after that I am free to find the missing ones or anything else I want about them. What do you think? Sincerely, Mark On Sun, Jun 26, 2011 at 12:34 AM, Kumar Kandasami kumaravel.kandas...@gmail.com wrote: Mark - A thought around accomplishing this as a MapReduce Job - if you could add the the datasource information in the mapper phase with record id as the key, in the reducer phase you can look for record ids with missing datasource and print the record id. Driver Code: MultipleInputs.addInputPath(conf, log1path, InputFormat, Log1Mapper); MultipleInputs.addInputPath(conf, log2path, InputFormat, Log2Mapper); Mapper Phase - Output - Key - Record Id, Value contains the datasource in addition to other values. Logic - add the datasource information to the record. Reduce Phase - Output - Print the Record Id that does not have log2 or log1 datasource value. Logic - add to the output only records that does not have log1 or log2 datasource. Kumar_/|\_ On Sat, Jun 25, 2011 at 11:39 PM, Mark Kerzner markkerz...@gmail.com wrote: Hi, I have two logs which should have all the records for the same record_id, in other words, if this record_id is found in the first log, it should also be found in the second one. However, I suspect that the second log is filtered out, and I need to find the missing records. Anything is allowed: MapReduce job, Hive, Pig, and even a NoSQL database. Thank you. It is also a good time to express my thanks to all the members of the group who are always very helpful. Sincerely, Mark
Backup and upgrade practices?
Hi, I am planning a small Hadoop cluster, but looking ahead, are there cheaps option to have a back up of the data? If I later want to upgrade the hardware, do I make a complete copy, or do I upgrade one node at a time? Thank you, Mark