Hardoop Environment Setup

2008-09-22 Thread radhika sridhar
Hi All,

I am a Graduate student. I am working on Hadoop for a college project and
have a few question on the hardoop set up.

> I am running hadoop on windows OS with cygin installed.
> In eclipse when i open the org.apache.hadoop.example.WordCount example
> file, i see a void main written for this class.
> so i am trying to run this program from the standalone mode, by passing
> the command line arguments.
>
> 1) Will i be able to run the program this way, because i am running this
> trough the windows system. If not
> how can i do the local set up so that i can make the changes to the file
> and run them on my system to test it?
>
> 2) if i make the changes to the file ( just have added few
> System.Out.println statements) and did a jar file from the examples
> package
> and tried to run it from cygwin, again it failed telling its not able to
> fine the main class. I haven't done any other changes to this apart from
> the print statements
>
> 3) In the example program, during the configuation state, we set the input
> path for the program. Will we be able to set two or more different paths
> this way?
> That is, Suppose i have two different files to be read, say one file is
> already in memory, will i be able to set the configurations such that the
> input for the Map-Reduce is
> to read one file from the disk and read the other from the memory in the
> same Map-Reduce iteration.

Please advise as to how i can proceed from here.
-- 
Thanks,
Radhika Sridhar


Re: Data corruption when using Lzo Codec

2008-09-22 Thread Chris Douglas
If you're using TextInputFormat, you need to add LzoCodec to the list  
of codecs in the io.compression.codecs property.


LzopCodec is only for reading/writing files produced/consumed by the C  
tool; it's not in 0.17. The ".lzo" files produced in 0.17 are not  
"real" .lzo files, but that's how you can get the codec to recognize  
them in this version. In the future, you might want to just use the  
lzo codec with SequenceFileOutputFormat (use BLOCK compression). -C


On Sep 19, 2008, at 8:46 AM, Alex Feinberg wrote:


Hi Chris,

I was also unable to decompress by simply doing a map/reducer with  
"cat"

as a mapper and then doing dfs -get either.

I will try using LzopCodec.

Thanks,
- Alex

On Fri, Sep 19, 2008 at 2:34 AM, Chris Douglas <[EMAIL PROTECTED] 
inc.com> wrote:
It's probably not corrupted. If by "compressed lzo file" you mean  
something
readable with lzop, you should use LzopCodec, not LzoCodec.  
LzoCodec doesn't

write header information required by that tool.

Guessing at the output format (length encoded blocks of data  
compressed by
the lzo algorithm), it's probably readable by TextInputFormat, but  
YMMV. If
you wanted to use the C tool, you'll have to add the appropriate  
header (see
lzop source or LzopCodec) using a hex editor and four zero bytes to  
the end

of the file. You can also use lzo compression in SequenceFiles. -C

On Sep 18, 2008, at 9:15 PM, Alex Feinberg wrote:


Hello,

I am running a custom crawler (written internally) using hadoop
streaming. I am attempting to
compress the output using LZO, but instead I am receiving corrupted
output that is neither in the
format I am aiming for nor as a compressed lzo file. Is this a known
issue? Is there anything
I am doing inherently wrong?

Here is the command line I am using:

~/hadoop/bin/hadoop jar
/home/hadoop/hadoop/contrib/streaming/hadoop-0.17.2.1-streaming.jar
-inputformat org.apache.hadoop.mapred.SequenceFileAsTextInputFormat
-mapper /home/hadoop/crawl_map -reducer NONE -jobconf
mapred.output.compress=true -jobconf
mapred 
.output.compression.codec=org.apache.hadoop.io.compress.LzoCodec

-input pages -output crawl.lzo -jobconf mapred.reduce.tasks=0

The input is in in form of URLs stored as a SequenceFile

When running this without LZO compression, no such issue occurs.

Is there any way for me to recover the corrupted data as to be  
able to

process it by other
hadoop jobs or offline?

Thanks,

--
Alex Feinberg
Platform Engineer, SocialMedia Networks







--
Alex Feinberg
Platform Engineer, SocialMedia Networks




Re: NotYetReplicated exceptions when pushing large files into HDFS

2008-09-22 Thread lohit
Yes, these are warning unless they fail for 3 times. In which case your dfs 
-put command would fail with stack trace.
Thanks,
Lohit



- Original Message 
From: Ryan LeCompte <[EMAIL PROTECTED]>
To: "core-user@hadoop.apache.org" 
Sent: Monday, September 22, 2008 5:18:01 PM
Subject: Re: NotYetReplicated exceptions when pushing large files into HDFS

I've noticed that although I get a few of these exceptions, the file
is ultimately uploaded to the HDFS cluster. Does this mean that my
file ended up getting there in 1 piece? The exceptions are just logged
at the WARN level and indicate retry attempts.

Thanks,
Ryan


On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I'd love to be able to upload into HDFS very large files (e.g., 8 or
> 10GB), but it seems like my only option is to chop up the file into
> smaller pieces. Otherwise, after a while I get NotYetReplication
> exceptions while the transfer is in progress. I'm using 0.18.1. Is
> there any way I can do this? Perhaps use something else besides
> bin/hadoop -put input output?
>
> Thanks,
> Ryan
>



Re: NotYetReplicated exceptions when pushing large files into HDFS

2008-09-22 Thread Ryan LeCompte
I've noticed that although I get a few of these exceptions, the file
is ultimately uploaded to the HDFS cluster. Does this mean that my
file ended up getting there in 1 piece? The exceptions are just logged
at the WARN level and indicate retry attempts.

Thanks,
Ryan


On Mon, Sep 22, 2008 at 11:08 AM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:
> Hello all,
>
> I'd love to be able to upload into HDFS very large files (e.g., 8 or
> 10GB), but it seems like my only option is to chop up the file into
> smaller pieces. Otherwise, after a while I get NotYetReplication
> exceptions while the transfer is in progress. I'm using 0.18.1. Is
> there any way I can do this? Perhaps use something else besides
> bin/hadoop -put input output?
>
> Thanks,
> Ryan
>


Re: Reduce tasks running out of memory on small hadoop cluster

2008-09-22 Thread Karl Anderson


On 20-Sep-08, at 7:07 PM, Ryan LeCompte wrote:


Hello all,

I'm setting up a small 3 node hadoop cluster (1 node for
namenode/jobtracker and the other two for datanode/tasktracker). The
map tasks finish fine, but the reduce tasks are failing at about 30%
with an out of memory error. My guess is because the amount of data
that I'm crunching through just won't be able to fit in memory during
the reduce tasks on two machines (max of 2 reduce tasks on each
machine). Is this expected? If I had a large hadoop cluster, then I
could increase the number of reduce tasks on each machine of the
cluster so that not all of the data to be processed is occurring in
just 4 JVMs on two machines like I currently have setup, correct? Is
there any way to get the reduce task to not try and hold all of the
data in memory, or is my only option to add more nodes to the cluster
to therefore increase the number of reduce tasks?


You can set the number of reduce tasks with a configuration option.   
More tasks means less input per task; since the number of concurrent  
tasks doesn't change, this should help you.  I'd like to be able to  
set the number of concurrent tasks, myself, but haven't noticed a way.


In the end, I had to practice better design to reduce my memory  
footprint; sometimes one quick-and-dirty way to do this is to turn one  
job into a chain of jobs that each do less.



Karl Anderson
[EMAIL PROTECTED]
http://monkey.org/~kra





Katta presentation slides

2008-09-22 Thread Deepika Khera
Hi Stefan,

 

Are the slides from the Katta presentation up somewhere? If not then
could you please post them?

 

Thanks,
Deepika



Re: accessing the number of emitted keys

2008-09-22 Thread Sandy
Thanks Owen!

-SM

On Mon, Sep 22, 2008 at 1:02 AM, Owen O'Malley <[EMAIL PROTECTED]> wrote:

>
> On Sep 21, 2008, at 9:33 PM, Sandy wrote:
>
>  Is there a way to get the total number of keys emitted by particular
>> mapper
>> in the beginning of the combiner function?
>>
>
> The short answer is no. As I said in my previous email, the combiner will
> get called when the first spill is being dumped. This can happen while the
> map is still running in a different thread. Therefore, the number wouldn't
> make much sense. Also note that the combiner may be called a second (or
> third or forth) time on a given record as the spills are merged.
>
> -- Owen
>


Re: Hadoop Cluster Size Scalability Numbers?

2008-09-22 Thread Konstantin Shvachko



Allen Wittenauer wrote:

On 9/21/08 2:51 PM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote:

Speaking about NFS-backup idea:
If I have secure nfs storage which is much slower than network (3MB/d vs
100MB/s network we use between nodes) will it adversely affect performance,
or I can rely on NFS caching to do the job?


I think Konstantin has some benchmarks in a JIRA somewhere that shows
that the current bottleneck isn't the fsimage/edits writes.


HADOOP-3860 has name-node benchmark numbers.
It concludes that for the name-node operations the bottleneck is exactly the 
edits writes.
But another conclusion is that real-world clusters do not provide enough load 
on the
name-node so that it could reach that bottleneck.
Particularly for NFS I found out that although it slows down the name-node
but the slow down is less than 5%.


And if nfs share dies, will it shutdown the namenode as well?


In our experiences, the name node continues.  But be warned that it will
only put a message in the name node log that the NFS mount became
unwritable.  There is a JIRA open to fix this though. 


Name-node treats NFS shares the same as local ones, it does not distinguish 
between
different storage directories. The name-node will continue to run until there 
is at
least one storage directory available. So if you have one NFS share and one 
local
and NFS fails the name-node will continue to run. But if NFS was the only 
storage
directory the name-node will shut down.

--Konstantin


NotYetReplicated exceptions when pushing large files into HDFS

2008-09-22 Thread Ryan LeCompte
Hello all,

I'd love to be able to upload into HDFS very large files (e.g., 8 or
10GB), but it seems like my only option is to chop up the file into
smaller pieces. Otherwise, after a while I get NotYetReplication
exceptions while the transfer is in progress. I'm using 0.18.1. Is
there any way I can do this? Perhaps use something else besides
bin/hadoop -put input output?

Thanks,
Ryan


Re: Hadoop Cluster Size Scalability Numbers?

2008-09-22 Thread Allen Wittenauer
On 9/21/08 2:51 PM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote:
> Speaking about NFS-backup idea:
> If I have secure nfs storage which is much slower than network (3MB/d vs
> 100MB/s network we use between nodes) will it adversely affect performance,
> or I can rely on NFS caching to do the job?

I think Konstantin has some benchmarks in a JIRA somewhere that shows
that the current bottleneck isn't the fsimage/edits writes.
 
> And if nfs share dies, will it shutdown the namenode as well?

In our experiences, the name node continues.  But be warned that it will
only put a message in the name node log that the NFS mount became
unwritable.  There is a JIRA open to fix this though. 



Re: Format of the value of "fs.default.name" in hadoop-site.xml

2008-09-22 Thread Samuel Guo
you can check ${HADOOP_HOME}/conf/hadoo-default.xml to see infomation about
"fs.default.name".

 
   fs.default.name
   file:///
   The name of the default file system.  A URI whose
   scheme and authority determine the FileSystem implementation.  The
   uri's scheme determines the config property (fs.SCHEME.impl) naming
   the FileSystem implementation class.  The uri's authority is used to
   determine the host, port, etc. for a filesystem.
 

On Mon, Sep 22, 2008 at 7:38 PM, Latha <[EMAIL PROTECTED]> wrote:

> Hi ,
>
> Please let me know if the value of fs.default.name value in the
> hadoop-site.xml should be in the format  ?
>
> (or) can it also be in the format of "hdfs://:"?
>
> Would request you to pls let me know which one is correct.
>
> Thankyou
> Srilatha
>


Format of the value of "fs.default.name" in hadoop-site.xml

2008-09-22 Thread Latha
Hi ,

Please let me know if the value of fs.default.name value in the
hadoop-site.xml should be in the format  ?

(or) can it also be in the format of "hdfs://:"?

Would request you to pls let me know which one is correct.

Thankyou
Srilatha


Re: The statistical spam filtering

2008-09-22 Thread Steve Loughran

Edward J. Yoon wrote:

Hi all,

To reduce the efforts of the artificial management for planet-scale
mail service, I'm consider about the statistical spam filtering with
the SpamAssasin, Hadoop (distributed computing), Hama (parallel matrix
computing) projects.

Please any advice (or experience) !!


Have you spoken to SpamAssassin? They'd probably love to get involved in 
a streams-based filtering system. One thing to know there is that a lot 
of their test data is private, as they have to include lots of 
legitimate email alongside the spam, so their big datasets aren't always 
that public.


Talk to Justin Mason and the spamassassin developers

-steve