RE: OutOfMemory Error
Yeah. That was the problem. And Hama can be surely useful for large scale matrix operations. But for this problem, I have modified the code to just pass the ID information and read the vector information only when it is needed. In this case, it was needed only in the reducer phase. This way, it avoided this problem of out of memory error and also faster now. Thanks Pallavi -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Edward J. Yoon Sent: Friday, September 19, 2008 10:35 AM To: core-user@hadoop.apache.org; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: OutOfMemory Error > The key is of the form "ID :DenseVector Representation in mahout with I guess vector size seems too large so it'll need a distributed vector architecture (or 2d partitioning strategies) for large scale matrix operations. The hama team investigate these problem areas. So, it will be improved If hama can be used for mahout in the future. /Edward On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti <[EMAIL PROTECTED]> wrote: > > Hadoop Version - 17.1 > io.sort.factor =10 > The key is of the form "ID :DenseVector Representation in mahout with > dimensionality size = 160k" > For example: C1:[,0.0011, 3.002, .. 1.001,] > So, typical size of the key of the mapper output can be 160K*6 (assuming > double in string is represented in 5 bytes)+ 5 (bytes for C1:[]) + size > required to store that the object is of type Text > > Thanks > Pallavi > > > > Devaraj Das wrote: >> >> >> >> >> On 9/17/08 6:06 PM, "Pallavi Palleti" <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi all, >>> >>>I am getting outofmemory error as shown below when I ran map-red on >>> huge >>> amount of data.: >>> java.lang.OutOfMemoryError: Java heap space >>> at >>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52) >>> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence >>> File.java:3002) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28 >>> 02) >>> at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) >>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124 >>> The above error comes almost at the end of map job. I have set the heap >>> size >>> to 1GB. Still the problem is persisting. Can someone please help me how >>> to >>> avoid this error? >> What is the typical size of your key? What is the value of io.sort.factor? >> Hadoop version? >> >> >> >> > > -- > View this message in context: > http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: scp to namenode faster than dfs put?
Even if writes are happening in parallel from a single machine, wouldn't the network congestion cause slow down due to packet collision? - Prasad. On Thursday 18 September 2008 10:47:48 pm Raghu Angadi wrote: > Steve Loughran wrote: > > [EMAIL PROTECTED] wrote: > >> thanks for the replies. So looks like replication might be the real > >> overhead when compared to scp. > > > > Makes sense, but there's no reason why you couldn't have first node you > > copy up the data to, continue and pass that data to the other nodes. > > Replication can not account for 50% slow down. When the data is written, > the writes on replicas are pipelined. So essentially data is written to > replicas in parallel. > > Raghu. > > > If > > its in the same rack, you save on backbone bandwidth, and if it is in a > > different rack, well, the client operation still finishes faster. A > > feature for someone to implement, perhaps? > > > >>> Also dfs put copies multiple replicas unlike scp. > >>> > >>> Lohit > >>> > >>> On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> > >>> wrote: > >>> > >>> Actually, No. > >>> As you said, I understand that "dfs -put" breaks the data into > >>> blocksand then copies to datanodes, > >>> but scp do not breaks the data into blocksand , and just copy the > >>> data to > >>> the namenode! > >>> > >>> > >>> 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>: > >>> > >>> Hello, > >>> I observe that scp of data to the namenode is faster than actually > >>> putting > >>> into dfs (all nodes coming from same switch and have same ethernet > >>> cards, > >>> homogenous nodes)? I understand that "dfs -put" breaks the data into > >>> blocks > >>> and then copies to datanodes, but shouldn't that be atleast as fast as > >>> copying data to namenode from a single machine, if not faster? > >>> > >>> thanks and regards, > >>> Prasad Pingali, > >>> IIIT Hyderabad. > >>> > >>> > >>> > >>> > >>> > >>> -- > >>> Sorry for my english!! 明 > >>> Please help me to correct my english expression and error in syntax
Re: Hadoop tracing
I was once tried to measure/report them on http://wiki.apache.org/hadoop/DataProcessingBenchmarks. I decided to stop because I just can't find time to do them. If you/anyone have an experience with hadoop, please report to that page. :) /Edward On Thu, Sep 18, 2008 at 7:25 PM, Naama Kraus <[EMAIL PROTECTED]> wrote: > Hi, > > I am looking for information in the area of Hadoop tracing, instrumentation, > benchmarking and so forth. > What utilities exist ? What's their maturity? Where can I get more info > about them ? > > I am curious about statistics on Hadoop behavior (per a typical workload ? > different workloads ?). I am thinking on various metrics such as - > Percentage of time a Hadoop job spends on the various phases (map, sort & > shuffle, reduce), on I/O, network, framework execution time, user code > execution time ... > Known bottlenecks ? > And whatever else interesting statistics. > > Has anyone already measured ? Any documented statistics out there ? > > I already encountered various stuff like the X-trace based tracing tool from > Berkeley, Hadoop metrics API, Hadoop instrumentation API (HADOOP-3772), > Hadoop Vaidya (HADOOP-4179), gridmix benchmark. > > Does anyone have an input on any of those ? > Anything else I missed ? > > Thanks for any direction, > Naama > > -- > oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo > 00 oo 00 oo > "If you want your children to be intelligent, read them fairy tales. If you > want them to be more intelligent, read them more fairy tales." (Albert > Einstein) > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: OutOfMemory Error
> The key is of the form "ID :DenseVector Representation in mahout with I guess vector size seems too large so it'll need a distributed vector architecture (or 2d partitioning strategies) for large scale matrix operations. The hama team investigate these problem areas. So, it will be improved If hama can be used for mahout in the future. /Edward On Thu, Sep 18, 2008 at 12:28 PM, Pallavi Palleti <[EMAIL PROTECTED]> wrote: > > Hadoop Version - 17.1 > io.sort.factor =10 > The key is of the form "ID :DenseVector Representation in mahout with > dimensionality size = 160k" > For example: C1:[,0.0011, 3.002, .. 1.001,] > So, typical size of the key of the mapper output can be 160K*6 (assuming > double in string is represented in 5 bytes)+ 5 (bytes for C1:[]) + size > required to store that the object is of type Text > > Thanks > Pallavi > > > > Devaraj Das wrote: >> >> >> >> >> On 9/17/08 6:06 PM, "Pallavi Palleti" <[EMAIL PROTECTED]> wrote: >> >>> >>> Hi all, >>> >>>I am getting outofmemory error as shown below when I ran map-red on >>> huge >>> amount of data.: >>> java.lang.OutOfMemoryError: Java heap space >>> at >>> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:52) >>> at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:90) >>> at >>> org.apache.hadoop.io.SequenceFile$Reader.nextRawKey(SequenceFile.java:1974) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$SegmentDescriptor.nextRawKey(Sequence >>> File.java:3002) >>> at >>> org.apache.hadoop.io.SequenceFile$Sorter$MergeQueue.merge(SequenceFile.java:28 >>> 02) >>> at org.apache.hadoop.io.SequenceFile$Sorter.merge(SequenceFile.java:2511) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1040) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:698) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:220) >>> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124 >>> The above error comes almost at the end of map job. I have set the heap >>> size >>> to 1GB. Still the problem is persisting. Can someone please help me how >>> to >>> avoid this error? >> What is the typical size of your key? What is the value of io.sort.factor? >> Hadoop version? >> >> >> >> > > -- > View this message in context: > http://www.nabble.com/OutOfMemory-Error-tp19531174p19545298.html > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
how to get the filenames stored in dfs as the key
hi everybody. can anyone plase help me how to get the input filename in dfs as the key in the output? example: [ filenames , value] - Unlimited freedom, unlimited storage. Get it now
Re: slow copy makes reduce hang
this time, I set task timeout to 10m via -jobconf mapred.task.timeout=60 However, I still see this "hang" at shuffle stage, and lots of messages below appear in the log 2008-09-19 12:34:02,289 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_01_1 Need 6 map output(s) 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_01_1: Got 0 new map-outputs & 0 obsolete map-outputs from tasktracker and 0 map-outputs from previous failures 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_01_1 Got 6 known map output location(s); scheduling... 2008-09-19 12:34:02,290 INFO org.apache.hadoop.mapred.ReduceTask: task_200809190308_0007_r_01_1 Scheduled 0 of 6 known outputs (6 slow hosts and 0 dup hosts) When fetching map output from one weird node (actually, it has a disk died), the http daemon returns 500 internal server error. It seems to me that the reducer fails in an infinite loop... I'm wondering this behavior is fixed in 0.18.x or there is some configuration parameters that I should tune with? Thanks, Rong-En Fan On Fri, Sep 19, 2008 at 9:42 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote: > Reply to myself. I'm using streaming and the task timeout was set to 0, > so that's why. > > On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote: >> Hi, >> >> I'm using 0.17.2.1 and see a reduce hang in shuffle phase due >> to a unresponsive node. From the reduce log (sorry that I didn't >> keep it around), it stuck in copying map output from a dead >> node (I can not ssh to that one). At that point, all maps are already >> finished. I'm wondering why this slowness does not trigger a reduce >> task fail and the corresponding map failed (even if it is finished) then >> redo the map task on another node so that the reduce can work. >> >> Thanks, >> Rong-En Fan >> >
Data corruption when using Lzo Codec
Hello, I am running a custom crawler (written internally) using hadoop streaming. I am attempting to compress the output using LZO, but instead I am receiving corrupted output that is neither in the format I am aiming for nor as a compressed lzo file. Is this a known issue? Is there anything I am doing inherently wrong? Here is the command line I am using: ~/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat -mapper /home/hadoop/crawl_map -reducer NONE -jobconf mapred.output.compress=true -jobconf mapred.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0 The input is in in form of URLs stored as a SequenceFile When running this without LZO compression, no such issue occurs. Is there any way for me to recover the corrupted data as to be able to process it by other hadoop jobs or offline? Thanks, -- Alex Feinberg Platform Engineer, SocialMedia Networks
Re: slow copy makes reduce hang
Reply to myself. I'm using streaming and the task timeout was set to 0, so that's why. On Fri, Sep 19, 2008 at 3:34 AM, Rong-en Fan <[EMAIL PROTECTED]> wrote: > Hi, > > I'm using 0.17.2.1 and see a reduce hang in shuffle phase due > to a unresponsive node. From the reduce log (sorry that I didn't > keep it around), it stuck in copying map output from a dead > node (I can not ssh to that one). At that point, all maps are already > finished. I'm wondering why this slowness does not trigger a reduce > task fail and the corresponding map failed (even if it is finished) then > redo the map task on another node so that the reduce can work. > > Thanks, > Rong-En Fan >
Re: custom writable class
Here is the link http://hadoop.apache.org/core/docs/current/mapred_tutorial.html On Thu, Sep 18, 2008 at 9:16 PM, chanel <[EMAIL PROTECTED]> wrote: > Where can you find the "Hadoop Map-Reduce Tutorial"? > > > Shengkai Zhu wrote: > >> You can refer to the Hadoop Map-Reduce Tutorial >> >> On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> Your custom implementation of any interface from hadoop-core should be >>> archived together with the application (i.e. in the same jar). >>> Andt he jar will be added to CLASSPATH of the task runner, then your >>> "customwritable.java" could be found. >>> >>> >>> On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED] >>> >wrote: >>> >>> >>> Hi, I am new to hadoop. For my map/reduce task I want to write my on custom writable class. Could anyone please let me know where exactly to place the customwritable.java file? I found that in {hadoop-home} /hadoop-{version}/src/java/org/apache/hadoop/io/ all type of writable class files are there. Then in the main task, we just include "import org.apache.hadoop.io.{X}Writable;" But this is not working for me. Basically at the time of compilation compiler doesn't find my customwritable class which i have placed in the mentioned folder. plz help me in this endevor. Thanks deepak >>> >>> -- >>> >>> 朱盛凯 >>> >>> Jash Zhu >>> >>> 复旦大学软件学院 >>> >>> Software School, Fudan University >>> >>> >>> >> >> >> >> > > > > No virus found in this outgoing message. > Checked by AVG - http://www.avg.com > Version: 8.0.169 / Virus Database: 270.6.21/1678 - Release Date: 9/18/2008 > 9:01 AM > > -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University
Re: streaming question
On 16-Sep-08, at 1:25 AM, Christian Ulrik Søttrup wrote: Ok i've tried what you suggested and all sorts of combinations with no luck. Then I went through the source of the Streaming lib. It looks like it checks for the existence of the combiner while it is building the jobconf i.e. before the job is sent to the nodes. It calls class.forName() on the combiner in goodClassOrNull() from StreamUtil.java called from setJobconf() in StreamJob.java. Anybody have an idea how i can use a custom combiner? would I have to package it into the streaming jar? That's what the streaming docs say you have to do - make your own streaming jar with them included. I tried the cache and jar arguments myself once, and Hadoop wasn't able to find them to use for the framework hooks, even when my streaming executables themselves were able to find them. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Example code for map-side join
Hello all, Does anyone have some working example code for doing a map-side (inner) join? The documentation at http://tinyurl.com/43j5pp is less than enlightening... Thanks, -Stuart
slow copy makes reduce hang
Hi, I'm using 0.17.2.1 and see a reduce hang in shuffle phase due to a unresponsive node. From the reduce log (sorry that I didn't keep it around), it stuck in copying map output from a dead node (I can not ssh to that one). At that point, all maps are already finished. I'm wondering why this slowness does not trigger a reduce task fail and the corresponding map failed (even if it is finished) then redo the map task on another node so that the reduce can work. Thanks, Rong-En Fan
[ANNOUNCE] Hadoop release 0.18.1 available
Release 0.18.1 fixes 9 critical bugs in 0.18.0. For Hadoop release details and downloads, visit: http://hadoop.apache.org/core/releases.html Hadoop 0.18.1 Release Notes are at http://hadoop.apache.org/core/docs/r0.18.1/releasenotes.html Thanks to all who contributed to this release! Nigel
Re: scp to namenode faster than dfs put?
James Moore wrote: Isn't one of the features of replication a guarantee that when my write finishes, I know there are N replicas written? This is what happens normally, but it is not a guarantee. When there are errors, data might be written to fewer replicas. Raghu. Seems like if you want the quicker behavior, you write with replication set to 1 for that file, then change the replication count when you're finished.
Re: scp to namenode faster than dfs put?
Steve Loughran wrote: [EMAIL PROTECTED] wrote: thanks for the replies. So looks like replication might be the real overhead when compared to scp. Makes sense, but there's no reason why you couldn't have first node you copy up the data to, continue and pass that data to the other nodes. Replication can not account for 50% slow down. When the data is written, the writes on replicas are pipelined. So essentially data is written to replicas in parallel. Raghu. If its in the same rack, you save on backbone bandwidth, and if it is in a different rack, well, the client operation still finishes faster. A feature for someone to implement, perhaps? Also dfs put copies multiple replicas unlike scp. Lohit On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote: Actually, No. As you said, I understand that "dfs -put" breaks the data into blocksand then copies to datanodes, but scp do not breaks the data into blocksand , and just copy the data to the namenode! 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>: Hello, I observe that scp of data to the namenode is faster than actually putting into dfs (all nodes coming from same switch and have same ethernet cards, homogenous nodes)? I understand that "dfs -put" breaks the data into blocks and then copies to datanodes, but shouldn't that be atleast as fast as copying data to namenode from a single machine, if not faster? thanks and regards, Prasad Pingali, IIIT Hyderabad. -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax
Re: scp to namenode faster than dfs put?
Isn't one of the features of replication a guarantee that when my write finishes, I know there are N replicas written? Seems like if you want the quicker behavior, you write with replication set to 1 for that file, then change the replication count when you're finished. -- James Moore | [EMAIL PROTECTED] Ruby and Ruby on Rails consulting blog.restphone.com
Re: [Zookeeper-user] [ANN] katta-0.1.0 release - distribute lucene indexes in a grid
This is great to see, congratulations! If you have a few minutes please update the ZK "poweredby" page: http://wiki.apache.org/hadoop/ZooKeeper/PoweredBy BTW. We're in the process of moving to Apache from SourceForge. Our next release, 3.0 slated for Oct22, will be on Apache. Regards, Patrick Stefan Groschupf wrote: After 5 month work we are happy to announce the first developer preview release of katta. This release contains all functionality to serve a large, sharded lucene index on many servers. Katta is standing on the shoulders of the giants lucene, hadoop and zookeeper. Main features: + Plays well with Hadoop + Apache Version 2 License. + Node failure tolerance + Master failover + Shard replication + Plug-able network topologies (Shard - Distribution and Selection Polices) + Node load balancing at client Please give katta a test drive and give us some feedback! Download: http://sourceforge.net/project/platformdownload.php?group_id=225750 website: http://katta.sourceforge.net/ Getting started in less than 3 min: http://katta.wiki.sourceforge.net/Getting+started Installation on a grid: http://katta.wiki.sourceforge.net/Installation Katta presentation today (09/17/08) at hadoop user, yahoo mission college: http://upcoming.yahoo.com/event/1075456/ * slides will be available online later Many thanks for the hard work: Johannes Zillmann, Marko Bauhardt, Martin Schaaf (101tec) I apologize the cross posting. Yours, the Katta Team. ~~~ 101tec Inc., Menlo Park, California http://www.101tec.com - This SF.Net email is sponsored by the Moblin Your Move Developer's challenge Build the coolest Linux based applications with Moblin SDK & win great prizes Grand prize is a trip for two to an Open Source event anywhere in the world http://moblin-contest.org/redirect.php?banner_id=100&url=/ ___ Zookeeper-user mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/zookeeper-user
Re: custom writable class
Where can you find the "Hadoop Map-Reduce Tutorial"? Shengkai Zhu wrote: You can refer to the Hadoop Map-Reduce Tutorial On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]> wrote: Your custom implementation of any interface from hadoop-core should be archived together with the application (i.e. in the same jar). Andt he jar will be added to CLASSPATH of the task runner, then your "customwritable.java" could be found. On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]>wrote: Hi, I am new to hadoop. For my map/reduce task I want to write my on custom writable class. Could anyone please let me know where exactly to place the customwritable.java file? I found that in {hadoop-home} /hadoop-{version}/src/java/org/apache/hadoop/io/ all type of writable class files are there. Then in the main task, we just include "import org.apache.hadoop.io.{X}Writable;" But this is not working for me. Basically at the time of compilation compiler doesn't find my customwritable class which i have placed in the mentioned folder. plz help me in this endevor. Thanks deepak -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University No virus found in this outgoing message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.6.21/1678 - Release Date: 9/18/2008 9:01 AM
Re: custom writable class
You can refer to the Hadoop Map-Reduce Tutorial On Thu, Sep 18, 2008 at 8:40 PM, Shengkai Zhu <[EMAIL PROTECTED]> wrote: > > Your custom implementation of any interface from hadoop-core should be > archived together with the application (i.e. in the same jar). > Andt he jar will be added to CLASSPATH of the task runner, then your > "customwritable.java" could be found. > > > On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am new to hadoop. For my map/reduce task I want to write my on custom >> writable class. Could anyone please let me know where exactly to place the >> customwritable.java file? >> >> I found that in {hadoop-home} >> /hadoop-{version}/src/java/org/apache/hadoop/io/ all type of writable >> class >> files are there. >> >> Then in the main task, we just include "import >> org.apache.hadoop.io.{X}Writable;" But this is not working for me. >> Basically >> at the time of compilation compiler doesn't find my customwritable class >> which i have placed in the mentioned folder. >> >> plz help me in this endevor. >> >> Thanks >> deepak >> > > > > -- > > 朱盛凯 > > Jash Zhu > > 复旦大学软件学院 > > Software School, Fudan University > -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University
Re: custom writable class
Your custom implementation of any interface from hadoop-core should be archived together with the application (i.e. in the same jar). Andt he jar will be added to CLASSPATH of the task runner, then your "customwritable.java" could be found. On Thu, Sep 18, 2008 at 8:09 PM, Deepak Diwakar <[EMAIL PROTECTED]> wrote: > Hi, > > I am new to hadoop. For my map/reduce task I want to write my on custom > writable class. Could anyone please let me know where exactly to place the > customwritable.java file? > > I found that in {hadoop-home} > /hadoop-{version}/src/java/org/apache/hadoop/io/ all type of writable > class > files are there. > > Then in the main task, we just include "import > org.apache.hadoop.io.{X}Writable;" But this is not working for me. > Basically > at the time of compilation compiler doesn't find my customwritable class > which i have placed in the mentioned folder. > > plz help me in this endevor. > > Thanks > deepak > -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University
custom writable class
Hi, I am new to hadoop. For my map/reduce task I want to write my on custom writable class. Could anyone please let me know where exactly to place the customwritable.java file? I found that in {hadoop-home} /hadoop-{version}/src/java/org/apache/hadoop/io/ all type of writable class files are there. Then in the main task, we just include "import org.apache.hadoop.io.{X}Writable;" But this is not working for me. Basically at the time of compilation compiler doesn't find my customwritable class which i have placed in the mentioned folder. plz help me in this endevor. Thanks deepak
Re: scp to namenode faster than dfs put?
On Thursday 18 September 2008 04:12:13 pm Steve Loughran wrote: > [EMAIL PROTECTED] wrote: > > thanks for the replies. So looks like replication might be the real > > overhead when compared to scp. > > Makes sense, but there's no reason why you couldn't have first node you > copy up the data to, continue and pass that data to the other nodes. If > its in the same rack, you save on backbone bandwidth, and if it is in a > different rack, well, the client operation still finishes faster. A > feature for someone to implement, perhaps? Yeah even I was thinking what would be the implications of such a feature in terms of any failures/block corruption at the first node. If that is a non-issue this seems to be something that can improve performance. - Prasad. > > >> Also dfs put copies multiple replicas unlike scp. > >> > >> Lohit > >> > >> On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote: > >> > >> Actually, No. > >> As you said, I understand that "dfs -put" breaks the data into blocksand > >> then copies to datanodes, > >> but scp do not breaks the data into blocksand , and just copy the data > >> to the namenode! > >> > >> > >> 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>: > >> > >> Hello, > >> I observe that scp of data to the namenode is faster than actually > >> putting > >> into dfs (all nodes coming from same switch and have same ethernet > >> cards, homogenous nodes)? I understand that "dfs -put" breaks the data > >> into blocks > >> and then copies to datanodes, but shouldn't that be atleast as fast as > >> copying data to namenode from a single machine, if not faster? > >> > >> thanks and regards, > >> Prasad Pingali, > >> IIIT Hyderabad. > >> > >> > >> > >> > >> > >> -- > >> Sorry for my english!! 明 > >> Please help me to correct my english expression and error in syntax
Re: scp to namenode faster than dfs put?
[EMAIL PROTECTED] wrote: thanks for the replies. So looks like replication might be the real overhead when compared to scp. Makes sense, but there's no reason why you couldn't have first node you copy up the data to, continue and pass that data to the other nodes. If its in the same rack, you save on backbone bandwidth, and if it is in a different rack, well, the client operation still finishes faster. A feature for someone to implement, perhaps? Also dfs put copies multiple replicas unlike scp. Lohit On Sep 17, 2008, at 6:03 AM, "��明" <[EMAIL PROTECTED]> wrote: Actually, No. As you said, I understand that "dfs -put" breaks the data into blocksand then copies to datanodes, but scp do not breaks the data into blocksand , and just copy the data to the namenode! 2008/9/17, Prasad Pingali <[EMAIL PROTECTED]>: Hello, I observe that scp of data to the namenode is faster than actually putting into dfs (all nodes coming from same switch and have same ethernet cards, homogenous nodes)? I understand that "dfs -put" breaks the data into blocks and then copies to datanodes, but shouldn't that be atleast as fast as copying data to namenode from a single machine, if not faster? thanks and regards, Prasad Pingali, IIIT Hyderabad. -- Sorry for my english!! 明 Please help me to correct my english expression and error in syntax -- Steve Loughran http://www.1060.org/blogxter/publish/5 Author: Ant in Action http://antbook.org/
Hadoop tracing
Hi, I am looking for information in the area of Hadoop tracing, instrumentation, benchmarking and so forth. What utilities exist ? What's their maturity? Where can I get more info about them ? I am curious about statistics on Hadoop behavior (per a typical workload ? different workloads ?). I am thinking on various metrics such as - Percentage of time a Hadoop job spends on the various phases (map, sort & shuffle, reduce), on I/O, network, framework execution time, user code execution time ... Known bottlenecks ? And whatever else interesting statistics. Has anyone already measured ? Any documented statistics out there ? I already encountered various stuff like the X-trace based tracing tool from Berkeley, Hadoop metrics API, Hadoop instrumentation API (HADOOP-3772), Hadoop Vaidya (HADOOP-4179), gridmix benchmark. Does anyone have an input on any of those ? Anything else I missed ? Thanks for any direction, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo "If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales." (Albert Einstein)
Re: Trouble with SequenceFileOutputFormat.getReaders
Hi Chris I would guess that the IOException is because getReaders() is trying to treat _logs as a file, when it's actually a directory. I also see race conditions in getReaders() since it lists the files then tries to iterate through them, and they can disappear in between. You probably need to delete the _logs directory before you pass the output directory to the second map. The _logs directory is also created by Hadoop 18.0. cheers Barry On Thursday 18 September 2008 05:49:30 Chris Dyer wrote: > Hi all- > I am having trouble with SequenceFileOutputFormat.getReaders on a > hadoop 17.2 cluster. I am trying to open a set of SequenceFiles that > was created in one map process that has completed from within a second > map process by passing in the job configuration for the running map > process (not of the map process that created the set of sequence > files) and the path to the output. When I run locally, this works > fine, but when I run remotely on the cluster (using HDFS on the > cluster), I get the following IOException: > > java.io.IOException: Cannot open filename /user/redpony/Model1.data.0/_logs > > However, the following works: > > hadoop dfs -ls /user/redpony/Model1.data.0/_logs > Found 1 items > /user/redpony/Model1.data.0/_logs/history2008-09-18 > 00:43 rwxrwxrwx redpony supergroup > > This is probably something dumb, and quite likely related to me not > having my settings configured properly, but I'm completely at a loss > for how to proceed. Any ideas? > > Thanks! > Chris