Hardoop Environment Setup
Hi All, I am a Graduate student. I am working on Hadoop for a college project and have a few question on the hardoop set up. > I am running hadoop on windows OS with cygin installed. > In eclipse when i open the org.apache.hadoop.example.WordCount example > file, i see a void main written for this class. > so i am trying to run this program from the standalone mode, by passing > the command line arguments. > > 1) Will i be able to run the program this way, because i am running this > trough the windows system. If not > how can i do the local set up so that i can make the changes to the file > and run them on my system to test it? > > 2) if i make the changes to the file ( just have added few > System.Out.println statements) and did a jar file from the examples > package > and tried to run it from cygwin, again it failed telling its not able to > fine the main class. I haven't done any other changes to this apart from > the print statements > > 3) In the example program, during the configuation state, we set the input > path for the program. Will we be able to set two or more different paths > this way? > That is, Suppose i have two different files to be read, say one file is > already in memory, will i be able to set the configurations such that the > input for the Map-Reduce is > to read one file from the disk and read the other from the memory in the > same Map-Reduce iteration. Please advise as to how i can proceed from here. -- Thanks, Radhika Sridhar
Re: Data corruption when using Lzo Codec
If you're using TextInputFormat, you need to add LzoCodec to the list of codecs in the io.compression.codecs property. LzopCodec is only for reading/writing files produced/consumed by the C tool; it's not in 0.17. The ".lzo" files produced in 0.17 are not "real" .lzo files, but that's how you can get the codec to recognize them in this version. In the future, you might want to just use the lzo codec with SequenceFileOutputFormat (use BLOCK compression). -C On Sep 19, 2008, at 8:46 AM, Alex Feinberg wrote: Hi Chris, I was also unable to decompress by simply doing a map/reducer with "cat" as a mapper and then doing dfs -get either. I will try using LzopCodec. Thanks, - Alex On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <[EMAIL PROTECTED] inc.com> wrote: It's probably not corrupted. If by "compressed lzo file" you mean something readable with lzop, you should use LzopCodec, not LzoCodec. LzoCodec doesn't write header information required by that tool. Guessing at the output format (length encoded blocks of data compressed by the lzo algorithm), it's probably readable by TextInputFormat, but YMMV. If you wanted to use the C tool, you'll have to add the appropriate header (see lzop source or LzopCodec) using a hex editor and four zero bytes to the end of the file. You can also use lzo compression in SequenceFiles. -C On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote: Hello, I am running a custom crawler (written internally) using hadoop streaming. I am attempting to compress the output using LZO, but instead I am receiving corrupted output that is neither in the format I am aiming for nor as a compressed lzo file. Is this a known issue? Is there anything I am doing inherently wrong? Here is the command line I am using: ~/hadoop/bin/hadoop jar /home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar -inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat -mapper /home/hadoop/crawl_map -reducer NONE -jobconf mapred.output.compress=true -jobconf mapred .output.compression.codec=org.apache.hadoop.io.compress.LzoCodec -input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0 The input is in in form of URLs stored as a SequenceFile When running this without LZO compression, no such issue occurs. Is there any way for me to recover the corrupted data as to be able to process it by other hadoop jobs or offline? Thanks, -- Alex Feinberg Platform Engineer, SocialMedia Networks -- Alex Feinberg Platform Engineer, SocialMedia Networks
Re: NotYetReplicated exceptions when pushing large files into HDFS
Yes, these are warning unless they fail for 3 times. In which case your dfs -put command would fail with stack trace. Thanks, Lohit - Original Message From: Ryan LeCompte <[EMAIL PROTECTED]> To: "core-user@hadoop.apache.org" Sent: Monday, September 22, 2008 5:18:01 PM Subject: Re: NotYetReplicated exceptions when pushing large files into HDFS I've noticed that although I get a few of these exceptions, the file is ultimately uploaded to the HDFS cluster. Does this mean that my file ended up getting there in 1 piece? The exceptions are just logged at the WARN level and indicate retry attempts. Thanks, Ryan On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I'd love to be able to upload into HDFS very large files (e.g., 8 or > 10GB), but it seems like my only option is to chop up the file into > smaller pieces. Otherwise, after a while I get NotYetReplication > exceptions while the transfer is in progress. I'm using 0.18.1. Is > there any way I can do this? Perhaps use something else besides > bin/hadoop -put input output? > > Thanks, > Ryan >
Re: NotYetReplicated exceptions when pushing large files into HDFS
I've noticed that although I get a few of these exceptions, the file is ultimately uploaded to the HDFS cluster. Does this mean that my file ended up getting there in 1 piece? The exceptions are just logged at the WARN level and indicate retry attempts. Thanks, Ryan On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote: > Hello all, > > I'd love to be able to upload into HDFS very large files (e.g., 8 or > 10GB), but it seems like my only option is to chop up the file into > smaller pieces. Otherwise, after a while I get NotYetReplication > exceptions while the transfer is in progress. I'm using 0.18.1. Is > there any way I can do this? Perhaps use something else besides > bin/hadoop -put input output? > > Thanks, > Ryan >
Re: Reduce tasks running out of memory on small hadoop cluster
On 20-Sep-08, at 7:07 PM, Ryan LeCompte wrote: Hello all, I'm setting up a small 3 node hadoop cluster (1 node for namenode/jobtracker and the other two for datanode/tasktracker). The map tasks finish fine, but the reduce tasks are failing at about 30% with an out of memory error. My guess is because the amount of data that I'm crunching through just won't be able to fit in memory during the reduce tasks on two machines (max of 2 reduce tasks on each machine). Is this expected? If I had a large hadoop cluster, then I could increase the number of reduce tasks on each machine of the cluster so that not all of the data to be processed is occurring in just 4 JVMs on two machines like I currently have setup, correct? Is there any way to get the reduce task to not try and hold all of the data in memory, or is my only option to add more nodes to the cluster to therefore increase the number of reduce tasks? You can set the number of reduce tasks with a configuration option. More tasks means less input per task; since the number of concurrent tasks doesn't change, this should help you. I'd like to be able to set the number of concurrent tasks, myself, but haven't noticed a way. In the end, I had to practice better design to reduce my memory footprint; sometimes one quick-and-dirty way to do this is to turn one job into a chain of jobs that each do less. Karl Anderson [EMAIL PROTECTED] http://monkey.org/~kra
Katta presentation slides
Hi Stefan, Are the slides from the Katta presentation up somewhere? If not then could you please post them? Thanks, Deepika
Re: accessing the number of emitted keys
Thanks Owen! -SM On Mon, Sep 22, 2008 at 1:02 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote: > > On Sep 21, 2008, at 9:33 PM, Sandy wrote: > > Is there a way to get the total number of keys emitted by particular >> mapper >> in the beginning of the combiner function? >> > > The short answer is no. As I said in my previous email, the combiner will > get called when the first spill is being dumped. This can happen while the > map is still running in a different thread. Therefore, the number wouldn't > make much sense. Also note that the combiner may be called a second (or > third or forth) time on a given record as the spills are merged. > > -- Owen >
Re: Hadoop Cluster Size Scalability Numbers?
Allen Wittenauer wrote: On 9/21/08 2:51 PM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote: Speaking about NFS-backup idea: If I have secure nfs storage which is much slower than network (3MB/d vs 100MB/s network we use between nodes) will it adversely affect performance, or I can rely on NFS caching to do the job? I think Konstantin has some benchmarks in a JIRA somewhere that shows that the current bottleneck isn't the fsimage/edits writes. HADOOP-3860 has name-node benchmark numbers. It concludes that for the name-node operations the bottleneck is exactly the edits writes. But another conclusion is that real-world clusters do not provide enough load on the name-node so that it could reach that bottleneck. Particularly for NFS I found out that although it slows down the name-node but the slow down is less than 5%. And if nfs share dies, will it shutdown the namenode as well? In our experiences, the name node continues. But be warned that it will only put a message in the name node log that the NFS mount became unwritable. There is a JIRA open to fix this though. Name-node treats NFS shares the same as local ones, it does not distinguish between different storage directories. The name-node will continue to run until there is at least one storage directory available. So if you have one NFS share and one local and NFS fails the name-node will continue to run. But if NFS was the only storage directory the name-node will shut down. --Konstantin
NotYetReplicated exceptions when pushing large files into HDFS
Hello all, I'd love to be able to upload into HDFS very large files (e.g., 8 or 10GB), but it seems like my only option is to chop up the file into smaller pieces. Otherwise, after a while I get NotYetReplication exceptions while the transfer is in progress. I'm using 0.18.1. Is there any way I can do this? Perhaps use something else besides bin/hadoop -put input output? Thanks, Ryan
Re: Hadoop Cluster Size Scalability Numbers?
On 9/21/08 2:51 PM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote: > Speaking about NFS-backup idea: > If I have secure nfs storage which is much slower than network (3MB/d vs > 100MB/s network we use between nodes) will it adversely affect performance, > or I can rely on NFS caching to do the job? I think Konstantin has some benchmarks in a JIRA somewhere that shows that the current bottleneck isn't the fsimage/edits writes. > And if nfs share dies, will it shutdown the namenode as well? In our experiences, the name node continues. But be warned that it will only put a message in the name node log that the NFS mount became unwritable. There is a JIRA open to fix this though.
Re: Format of the value of "fs.default.name" in hadoop-site.xml
you can check ${HADOOP_HOME}/conf/hadoo-default.xml to see infomation about "fs.default.name". fs.default.name file:/// The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem. On Mon, Sep 22, 2008 at 7:38 PM, Latha <[EMAIL PROTECTED]> wrote: > Hi , > > Please let me know if the value of fs.default.name value in the > hadoop-site.xml should be in the format ? > > (or) can it also be in the format of "hdfs://:"? > > Would request you to pls let me know which one is correct. > > Thankyou > Srilatha >
Format of the value of "fs.default.name" in hadoop-site.xml
Hi , Please let me know if the value of fs.default.name value in the hadoop-site.xml should be in the format ? (or) can it also be in the format of "hdfs://:"? Would request you to pls let me know which one is correct. Thankyou Srilatha
Re: The statistical spam filtering
Edward J. Yoon wrote: Hi all, To reduce the efforts of the artificial management for planet-scale mail service, I'm consider about the statistical spam filtering with the SpamAssasin, Hadoop (distributed computing), Hama (parallel matrix computing) projects. Please any advice (or experience) !! Have you spoken to SpamAssassin? They'd probably love to get involved in a streams-based filtering system. One thing to know there is that a lot of their test data is private, as they have to include lots of legitimate email alongside the spam, so their big datasets aren't always that public. Talk to Justin Mason and the spamassassin developers -steve