Re: Need help in hdfs configuration fully distributed way in Mac OSX...

2008-09-15 Thread Mafish Liu
Hi:
  You need to configure your nodes to ensure that node 1 can connect to node
2 without password.

On Tue, Sep 16, 2008 at 2:04 PM, souravm <[EMAIL PROTECTED]> wrote:

> Hi All,
>
> I'm facing a problem in configuring hdfs in a fully distributed way in Mac
> OSX.
>
> Here is the topology -
>
> 1. The namenode is in machine 1
> 2. There is 1 datanode in machine 2
>
> Now when I execute start-dfs.sh from machine 1, it connects to machine 2
> (after it asks for password for connecting to machine 2) and starts datanode
> in machine 2 (as the console message says).
>
> However -
> 1. When I go to http://machine1:50070 - it does not show the data node at
> all. It says 0 data node configured
> 2. In the log file in machine 2 what I see is -
> /
> STARTUP_MSG: Starting DataNode
> STARTUP_MSG:   host = rc0902b-dhcp169.apple.com/17.229.22.169
> STARTUP_MSG:   args = []
> STARTUP_MSG:   version = 0.17.2.1
> STARTUP_MSG:   build =
> https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r
> 684969; compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
> /
> 2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 1 time(s).
> 2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 2 time(s).
> 2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 3 time(s).
> 2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 4 time(s).
> 2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 5 time(s).
> 2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 6 time(s).
> 2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 7 time(s).
> 2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 8 time(s).
> 2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 9 time(s).
> 2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect
> to server: /17.229.23.77:9000. Already tried 10 time(s).
> 2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at /
> 17.229.23.77:9000 not available yet, Z...
>
> ... and this retyring gets on repeating
>
>
> The  hadoop-site.xmls are like this -
>
> 1. In machine 1
> -
> 
>
>  
>fs.default.name
>hdfs://localhost:9000
>  
>
>   
>dfs.name.dir
>/Users/souravm/hdpn
>  
>
>  
>mapred.job.tracker
>localhost:9001
>  
>  
>dfs.replication
>1
>  
> 
>
>
> 2. In machine 2
>
> 
>
>  
>fs.default.name
>hdfs://:9000
>  
>  
>dfs.data.dir
>/Users/nirdosh/hdfsd1
>  
>  
>dfs.replication
>1
>  
> 
>
> The slaves file in machine 1 has single entry - @ machine2>
>
> The exact steps I did -
>
> 1. Reformat the namenode in machine 1
> 2. execute start-dfs.sh in machine 1
> 3. Then I try to see whether the datanode is created through http:// 1 ip>:50070
>
> Any pointer to resolve this issue would be appreciated.
>
> Regards,
> Sourav
>
>
>
>  CAUTION - Disclaimer *
> This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended
> solely
> for the use of the addressee(s). If you are not the intended recipient,
> please
> notify the sender by e-mail and delete the original message. Further, you
> are not
> to copy, disclose, or distribute this e-mail or its contents to any other
> person and
> any such actions are unlawful. This e-mail may contain viruses. Infosys has
> taken
> every reasonable precaution to minimize this risk, but is not liable for
> any damage
> you may sustain as a result of any virus in this e-mail. You should carry
> out your
> own virus checks before opening the e-mail or attachment. Infosys reserves
> the
> right to monitor and review the content of all messages sent to or from
> this e-mail
> address. Messages sent to or from this e-mail address may be stored on the
> Infosys e-mail system.
> ***INFOSYS End of Disclaimer INFOSYS***
>



-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Need help in hdfs configuration fully distributed way in Mac OSX...

2008-09-15 Thread souravm
Hi All,

I'm facing a problem in configuring hdfs in a fully distributed way in Mac OSX.

Here is the topology -

1. The namenode is in machine 1
2. There is 1 datanode in machine 2

Now when I execute start-dfs.sh from machine 1, it connects to machine 2 (after 
it asks for password for connecting to machine 2) and starts datanode in 
machine 2 (as the console message says).

However -
1. When I go to http://machine1:50070 - it does not show the data node at all. 
It says 0 data node configured
2. In the log file in machine 2 what I see is -
/
STARTUP_MSG: Starting DataNode
STARTUP_MSG:   host = rc0902b-dhcp169.apple.com/17.229.22.169
STARTUP_MSG:   args = []
STARTUP_MSG:   version = 0.17.2.1
STARTUP_MSG:   build = 
https://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 684969; 
compiled by 'oom' on Wed Aug 20 22:29:32 UTC 2008
/
2008-09-15 18:54:44,626 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 1 time(s).
2008-09-15 18:54:45,627 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 2 time(s).
2008-09-15 18:54:46,628 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 3 time(s).
2008-09-15 18:54:47,629 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 4 time(s).
2008-09-15 18:54:48,630 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 5 time(s).
2008-09-15 18:54:49,631 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 6 time(s).
2008-09-15 18:54:50,632 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 7 time(s).
2008-09-15 18:54:51,633 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 8 time(s).
2008-09-15 18:54:52,635 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 9 time(s).
2008-09-15 18:54:53,640 INFO org.apache.hadoop.ipc.Client: Retrying connect to 
server: /17.229.23.77:9000. Already tried 10 time(s).
2008-09-15 18:54:54,641 INFO org.apache.hadoop.ipc.RPC: Server at 
/17.229.23.77:9000 not available yet, Z...

... and this retyring gets on repeating


The  hadoop-site.xmls are like this -

1. In machine 1
-


  
fs.default.name
hdfs://localhost:9000
  

   
dfs.name.dir
/Users/souravm/hdpn
  

  
mapred.job.tracker
localhost:9001
  
  
dfs.replication
1
  



2. In machine 2



 
fs.default.name
hdfs://:9000
  
  
dfs.data.dir
/Users/nirdosh/hdfsd1
  
  
dfs.replication
1
  


The slaves file in machine 1 has single entry - @

The exact steps I did -

1. Reformat the namenode in machine 1
2. execute start-dfs.sh in machine 1
3. Then I try to see whether the datanode is created through http://:50070

Any pointer to resolve this issue would be appreciated.

Regards,
Sourav



 CAUTION - Disclaimer *
This e-mail contains PRIVILEGED AND CONFIDENTIAL INFORMATION intended solely 
for the use of the addressee(s). If you are not the intended recipient, please 
notify the sender by e-mail and delete the original message. Further, you are 
not 
to copy, disclose, or distribute this e-mail or its contents to any other 
person and 
any such actions are unlawful. This e-mail may contain viruses. Infosys has 
taken 
every reasonable precaution to minimize this risk, but is not liable for any 
damage 
you may sustain as a result of any virus in this e-mail. You should carry out 
your 
own virus checks before opening the e-mail or attachment. Infosys reserves the 
right to monitor and review the content of all messages sent to or from this 
e-mail 
address. Messages sent to or from this e-mail address may be stored on the 
Infosys e-mail system.
***INFOSYS End of Disclaimer INFOSYS***


[HELP] "CreateProcess error=193" When running "ant -Dcompile.c++=yes examples"

2008-09-15 Thread zheng daqi
Hello,

I got a problem when I compiled hadoop's pipes' example(under cygwin).
and I searched a lot, and find nobody have met the same problem.
could you please give me some suggestions, thanks very much.

when I run
$ ant
$ ant examples
both BUILD SUCCESSFUL

but when I run
$ ant -Dcompile.c++=yes examples

the error info list below:

Buildfile: build.xml

clover.setup:

clover.info:
 [echo]
 [echo]  Clover not found. Code coverage reports disabled.
 [echo]

clover:

init:
[touch] Creating c:\DOCUME~1\GODBLE~1\LOCALS~1\Temp\null1553129825
   [delete] Deleting: c:\DOCUME~1\GODBLE~1\LOCALS~1\Temp\null1553129825
 [exec] src/saveVersion.sh: line 22: svn: command not found
 [exec] src/saveVersion.sh: line 23: svn: command not found

record-parser:

compile-rcc-compiler:

compile-core-classes:
[javac] Compiling 1 source file to
E:\cygwin\home\godblesswho\hadoopinstall\
hadoop-0.18.0\build\classes

compile-core-native:

check-c++-makefiles:

create-c++-pipes-makefile:

BUILD FAILED
E:\cygwin\home\godblesswho\hadoopinstall\hadoop-0.18.0\build.xml:1115:
Execute f
ailed: java.io.IOException: Cannot run program
"E:\cygwin\home\godblesswho\hadoo
pinstall\hadoop-0.18.0\src\c++\pipes\configure" (in directory
"E:\cygwin\home\go
dblesswho\hadoopinstall\hadoop-0.18.0\build\c++-build\Windows_XP-x86-32\pipes"):
 CreateProcess error=193, %1 Ч Win32 ó


Lots of connections in TIME_WAIT

2008-09-15 Thread damien . cooke
Hi all,
We have a 10 node cluster and I noticed that there are a large amount of 
connections in TIME_WAIT status  over 400 to more precise.  Does anyone 
else see so many unclosed connections or have done something wrong in my 
config.  It appears to be coming from and going to other datanodes.   
I was running the javasort benchmark when I noticed it.

Regards
Damien






Re: Small Filesizes

2008-09-15 Thread Mafish Liu
Hi,
  I'm just working on this situation you described, with millions of small
files sized around 10KB.
  My idea is to compact this files into big ones and create indexes for
them. This is a file system over file system and support append update, lazy
delete.
  May this help .

-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: How to manage a large cluster?

2008-09-15 Thread Paco NATHAN
We use an EC2 image onto which we install Java, Ant, Hadoop, etc. To
make it simple, pull those from S3 buckets. That provides a flexible
pattern for managing the frameworks involved, more so than needing to
re-do an EC2 image whenever you want to add a patch to Hadoop.

Given that approach, you can add your Hadoop application code
similarly. Just upload the current stable build out of SVN, Git,
whatever, to an S3 bucket.

We use a set of Python scripts to manage a daily, (mostly) automated
launch of 100+ EC2 nodes for a Hadoop cluster.  We also run a listener
on a local server, so that the Hadoop job can send notification when
it completes, and allow the local server to initiate download of
results.  Overall, that minimizes the need for having a sysadmin
dedicated to the Hadoop jobs -- a small dev team can handle it, while
focusing on algorithm development and testing.


>>  Or on EC2 and its competitors, just build a new image whenever you
>>> need to update Hadoop itself.


Re: How to manage a large cluster?

2008-09-15 Thread 叶双明
Sorry, but I can't open it:
http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment

2008/9/13 Steve Loughran <[EMAIL PROTECTED]>

> James Moore wrote:
>
>> On Thu, Sep 11, 2008 at 5:46 AM, Allen Wittenauer <[EMAIL PROTECTED]>
>> wrote:
>>
>>> On 9/11/08 2:39 AM, "Alex Loddengaard" <[EMAIL PROTECTED]> wrote:
>>>
 I've never dealt with a large cluster, though I'd imagine it is managed
 the
 same way as small clusters:

>>>   Maybe. :)
>>>
>>
> Depends how often you like to be paged, doesn't it :)
>
>
>>Instead, use a real system configuration management package such as
>>> bcfg2, smartfrog, puppet, cfengine, etc.  [Steve, you owe me for the
>>> plug.
>>> :) ]
>>>
>>
> Yes Allen, I owe you beer at the next apachecon we are both at.
> Actually, I think Y! were one of the sponsors at the UK event, so we owe
> you for that too.
>
>
>  Or on EC2 and its competitors, just build a new image whenever you
>> need to update Hadoop itself.
>>
>
>
> 1. It's still good to have as much automation of your image build as you
> can; if you can build new machine images on demand you have have fun/make a
> mess of things. Look at http://instalinux.com to see the web GUI for
> creating linux images on demand that is used inside HP.
>
> 2. When you try and bring up everything from scratch, you have a
> choreography problem. DNS needs to be up early, and your authentication
> system, the management tools, then the other parts of the system. If you
> have a project where hadoop is integrated with the front end site, for
> example, you're app servers have to stay offline until HDFS is live. So it
> does get complex.
>
> 3. The Hadoop nodes are good here in that you aren't required to bring up
> the namenode first; the datanodes will wait; same for the task trackers and
> job tracker. But if you, say, need to point everything at a new hostname for
> the namenode, well, that's a config change that needs to be pushed out,
> somehow.
>
>
>
> I'm adding some stuff on different ways to deploy hadoop here:
>
> http://wiki.smartfrog.org/wiki/display/sf/Patterns+of+Hadoop+Deployment
>
> -steve
>



-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax


Re: Thinking about retriving DFS metadata from datanodes!!!

2008-09-15 Thread 叶双明
Thanks, Steve Loughran. I learned sth from you!

2008/9/13 Steve Loughran <[EMAIL PROTECTED]>

> 叶双明 wrote:
>
>> Thanks for paying attention  to my tentative idea!
>>
>> What I thought isn't how to store the meradata, but the final (or last)
>> way
>> to recover valuable data in the cluster when something worst (which
>> destroy
>> the metadata in all multiple NameNode) happen. i.e. terrorist attack  or
>> natural disasters destroy half of cluster nodes within all NameNode, we
>> can
>> recover as much data as possible by this mechanism, and hava big chance to
>> recover entire data of cluster because fo original replication.
>>
>
>
> You want to survive any event that loses a datacentre, you need to mirror
> the data off site, chosing that second site with an up to date fault line
> map of the city, geological knowledge of where recent eruptions ended up
> etc. Which is why nobody builds datacentres in Enumclaw WA that I'm aware
> of, the spec for the fabs in/near portland is they ought to withstand 1-2m
> of volcanic ash landing on them (what they'd have got if there'd been an
> easterly wind when Mount Saint Helens went). Then once you have some safe
> location for the second site, talk to your telco about how the
> high-bandwidth backbones in your city flow (Metropolitan Area Ethernet and
> the like), and try and find somewhere that meets your requirements.
>
> Then: come up with a protocol that efficiently keeps the two sites up to
> date. And reliably: S3 went down last month because they'd been using a
> Gossip-style update protocol but wheren't checksumming everything, because
> there's no need on a LAN, but of course on a cross-city network more things
> can go wrong, and for them it did.
>
> Something to keep multiple hadoop filesystems synchronised efficiently and
> reliably across sites could be very useful to many people.
>
> -steve
>



-- 
Sorry for my english!! 明
Please help me to correct my english expression and error in syntax


Re: guaranteeing disk space?

2008-09-15 Thread Owen O'Malley


On Sep 15, 2008, at 11:24 AM, Kayla Jay wrote:

How does one do a check or guarantee there's enough disk space when  
running a hadoop job  that you're not sure how much it will produce  
in its results (temp files, etc) ?


In 0.19 there is new code that waits until the first N% of maps are  
run and estimates the amount of space required for each of the  
following tasks. You can see the discussion here:


https://issues.apache.org/jira/browse/HADOOP-657

The task tracker can also set the mapred.local.dir.minspacestart  
variable, which controls the minimum amount of disk space that must be  
free before it will ask for a new task.


Or, what if you run out of disk space on the HDFS if you are running  
large jobs with large outputs ?  The job just fails .. but how can  
one assess this  resource allocation of disk space while running  
your jobs?


Map/Reduce works by re-executing tasks that fail, including tasks that  
fail for lack of disk space. If the task fails, the partial results  
are erased on the assumption that they will be run later. The tasks  
that finish, will have their output in the output directory, even if  
the job fails.


-- Owen


Re: guaranteeing disk space?

2008-09-15 Thread Sandy
I'm not sure if this completely answers your question, but I don't think
there is something built-in inside hadoop that automizes cleanup (between MR
phases). You may have to do it yourself afterwards.

Are you dealing with multiple map reduce phases? If so, your intermediary
files (which are stored in the intermediary directory of your choice) can
just be deleted afterwards. When I'm running multiple jobs I usually write a
wrapper script that runs the job and removes the intermediary files before
it runs the next job.

#!/bin/bash

hadoop jar myjar.jar myjob input intermediate output
bin/hadoop -rmr intermediate
hadoop jar myjar.jar myjob input2 intermediate output2


I had disk space errors when I was running my job on a previous machine. The
problem is that too much was filling up in the logs (so I took out some
stdout statements that I was using for debugging purposes), and then I added
the whole -rmr thing to my scripts.

But what will definitely work is getting a machine with more hard disk.

Good luck,
-SM

On Mon, Sep 15, 2008 at 1:24 PM, Kayla Jay <[EMAIL PROTECTED]> wrote:

> How does one do a check or guarantee there's enough disk space when running
> a hadoop job  that you're not sure how much it will produce in its results
> (temp files, etc) ?
>
> I.e when you run a hadoop job and you're not exactly sure how much disk
> space it will eat up (given temp dirs), the job will fail if it does run
> out.
>
> How do you guarantee while you're job is running that there's enough disk
> space on the nodes and kick off cleanup (so the job won't fail) if you're
> running into low disk space?
>
> For example, if your maps are failing since there isn't enough  temporary
> disk space on your nodes while you run a job, how can you fix that up front
> prior to running or better yet while the job is running from causing a
> failed job? The outputs of maps are stored on the local-disk of the nodes
> on which they were executed, and if your nodes don't have  enough while
> running jobs, how can you fix this at run time?  Can I catch this condition
> at all?
>
> Is there a way to fix this at run time?  How do others solve this issue
> when running jobs that you're not sure how much disk space it will consume?
>
> ---
> Or, what if you run out of disk space on the HDFS if you are running large
> jobs with large outputs ?  The job just fails .. but how can one assess this
>  resource allocation of disk space while running your jobs?
>
> If you run out of HDFS disk space, and you know you want the results of job
> X, is there a way to find out while running that you can do some smart
> cleanup as to not lose what data could've been produced by job X?
>
>
>
>


guaranteeing disk space?

2008-09-15 Thread Kayla Jay
How does one do a check or guarantee there's enough disk space when running a 
hadoop job  that you're not sure how much it will produce in its results (temp 
files, etc) ?

I.e when you run a hadoop job and you're not exactly sure how much disk space 
it will eat up (given temp dirs), the job will fail if it does run out.

How do you guarantee while you're job is running that there's enough disk space 
on the nodes and kick off cleanup (so the job won't fail) if you're running 
into low disk space?

For example, if your maps are failing since there isn't enough  temporary disk 
space on your nodes while you run a job, how can you fix that up front prior to 
running or better yet while the job is running from causing a failed job? The 
outputs of maps are stored on the local-disk of the nodes  
on which they were executed, and if your nodes don't have  enough while running 
jobs, how can you fix this at run time?  Can I catch this condition at all?

Is there a way to fix this at run time?  How do others solve this issue when 
running jobs that you're not sure how much disk space it will consume?

---
Or, what if you run out of disk space on the HDFS if you are running large jobs 
with large outputs ?  The job just fails .. but how can one assess this  
resource allocation of disk space while running your jobs?

If you run out of HDFS disk space, and you know you want the results of job X, 
is there a way to find out while running that you can do some smart cleanup as 
to not lose what data could've been produced by job X?



  

Re: Small Filesizes

2008-09-15 Thread Konstantin Shvachko

Peter,

You are likely to hit memory limitations on the name-node.
With 100 million small files it will need to support 200 mln objects,
which will require roughly 30 GB of RAM on the name-node.
You may also consider hadoop archives or present your files as a
collection of records and use Pig, Hive etc.

--Konstantin

Brian Vargas wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

Peter,

In my testing with files of that size (well, larger, but still well
below the block size) it was impossible to achieve any real throughput
on the data because of the overhead of looking up the locations to all
those files on the NameNode.  Your application spends so much time
looking up file names that most of the CPUs sit idle.

A simple solution is to just load all of the small files into a sequence
file, and process the sequence file instead.

Brian

Peter McTaggart wrote:

Hi All,



I am considering using HDFS for an application that potentially has many
small files – ie  10-100 million files with an estimated average filesize of
50-100k (perhaps smaller) and is an online interactive application.

All of the documentation I have seen suggests that a blockszie of 64-128Mb
works best for Hadoop/HDFS and it is best used for batch oriented
applications.



Does anyone have any experience using it for files of this size  in an
online application environment?

Is it worth pursuing HDFS for this type of application?



Thanks

Peter


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX
dIHsqlePzQHQYFr9AwhkI3I=
=gmAj
-END PGP SIGNATURE-



Re: SequenceFiles and binary data

2008-09-15 Thread Owen O'Malley


On Sep 14, 2008, at 7:15 PM, John Howland wrote:


If I want to read values out of input files as binary data, is this
what BytesWritable is for?


yes


I've successfully run my first task that uses a SequenceFile for
output. Are there any examples of SequenceFile usage out there? I'd
like to see the full range of what SequenceFile can do.


If you want serious usage, I'd suggest pulling up Nutch. Distcp also  
uses sequence files as its input.


You should also probably look at the TFile package that Hong is writing.

https://issues.apache.org/jira/browse/HADOOP-3315

Once it is ready, it will likely be exactly what you are looking for.


What are the
trade-offs between record compression and block compression?


You pretty much always want block compression. The only place where  
record compression is ok, is if your value is web pages or some other  
huge chunk of text.



What are
the limits on the key and value sizes?


Large.  I think I've see keys and/or values of around 50-100mb. It  
certainly can't be bigger than 1g. I believe the TFile limit on keys  
may be 64k.



How do you use the per-file
metadata?


It is just an application specific string to string map in the header  
of the file.



My intended use is to read files on a local filesystem into a
SequenceFile, with the value of each record being the contents of each
file. I hacked MultiFileWordCount to get the basic concept working...


You should also look at the Hadoop archives.
http://hadoop.apache.org/core/docs/r0.18.0/hadoop_archives.html


but I'd appreciate any advice from the experts. In particular, what's
the most efficient way to read data from an
InputStreamReader/BufferedReader into a BytesWritable object?


The easiest way is the way you've done it. You probably want to use  
lzo compression too.


-- Owen


Re: Implementing own InputFormat and RecordReader

2008-09-15 Thread Owen O'Malley

On Sep 15, 2008, at 6:13 AM, Juho Mäkinen wrote:


1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
input file is 128MB and my HDFS block size is 64MB, will it return one
InputSplit or two InputSplits?


Your InputFormat needs to define:

protected boolean isSplitable(FileSystem fs, Path filename) {
  return false;
}

which tells the FileInputFormat.getSplits to not split files. You will  
end up with a single split for each file.



2) If my file is splitted into two or more filesystem blocks, how will
hadoop handle the reading of those blocks? As the file must be read in
sequence, will hadoop first copy every block to a machine (if the
blocks aren't already in there) and then start the mapper in this
machine? Do I need to handle the reading and opening multiple blocks,
or will hadoop provide me a simple stream interface which I can use to
read the entire file without worrying if the file is larger than the
HDFS block size?


HDFS transparently handles the data motion for you. You can just use  
FileSystem.open(path) and HDFS will pull the file from the closest  
location. It doesn't actually move the block to your local disk, just  
gives it to the application. Basically, you don't need to worry about  
it.


There are two downsides to unsplitable files. The first is that if  
they are large, the map times can be very long. The second is that the  
map/reduce scheduler tries to place the tasks close to the data, which  
it can't do very well if the data spans blocks. Of course if data  
isn't splitable, you don't have a choice.


-- Owen

Re: streaming question

2008-09-15 Thread Dennis Kubes

If testlink is a package, it should be:

hadoop -jar streaming/hadoop-0.17.0-streaming.jar -input store -output 
cout -mapper MyProg -combiner testlink.combiner -reducer testlink.reduce 
-file /home/hadoop/MyProg -cacheFile /shared/part-0#in.cl 
-cacheArchive /related/MyJar.jar#testlink


if not a package, remove the testlink part.

Dennis

Christian Ulrik Søttrup wrote:
Ok, so I added the JAR to the cacheArchive option and my command looks 
like this:


hadoop jar streaming/hadoop-0.17.0-streaming.jar  -input /store/ -output 
/cout/ -mapper MyProg -combiner testlink/combiner.class -reducer 
testlink/reduce.class -file /home/hadoop/MyProg -cacheFile 
/shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink


Now it fails because it cannot find the combiner.  The cacheArchive 
option creates a symlink in the local running directory, correct? Just 
like the cacheFile option? If not how can i then specify which class to 
use?


cheers,
Christian

Amareshwari Sriramadasu wrote:

Dennis Kubes wrote:
If I understand what you are asking you can use the -cacheArchive 
with the path to the jar to including the jar file in the classpath 
of your streaming job.


Dennis

You can also use -cacheArchive option to include jar file and symlink 
the unjarred directory from cwd by providing the uri as 
hdfs://#link. You have to provide -reducer and -combiner options 
as appropriate paths in the unjarred directory.


Thanks
Amareshwari

Christian Søttrup wrote:

Hi all,

I have an application that i use to run with the "hadoop jar" command.
I have now written an optimized version of the mapper in C.
I have run this using the streaming library and everything looks ok 
(using num.reducers=0).


Now i want to use this mapper together with the combiner and reducer 
from my old .jar file.
How do i do this? How can i distribute the jar and run the reducer 
and combiner from it?

While also running the c program as the mapper in streaming mode.

cheers,
Christian






Implementing own InputFormat and RecordReader

2008-09-15 Thread Juho Mäkinen
I'm trying to implement my own InputFormat and RecordReader for my
data and I'm stuck as I can't find enough documentation about the
details.

My input format is a tightly packed binary data consisting individual
event packets. Each event packet contains its length and the packets
are simply appended into end of an file. Thus the file must be read as
stream and it cannot be splitted.

FileInputFormat seems like a reasonable place to start but I
immediately ran into problems and unanswered questions:

1) The FileInputFormat.getSplits() returns InputSplit[] array. If my
input file is 128MB and my HDFS block size is 64MB, will it return one
InputSplit or two InputSplits?

2) If my file is splitted into two or more filesystem blocks, how will
hadoop handle the reading of those blocks? As the file must be read in
sequence, will hadoop first copy every block to a machine (if the
blocks aren't already in there) and then start the mapper in this
machine? Do I need to handle the reading and opening multiple blocks,
or will hadoop provide me a simple stream interface which I can use to
read the entire file without worrying if the file is larger than the
HDFS block size?

 - Juho Mäkinen


Re: Small Filesizes

2008-09-15 Thread Brian Vargas
-BEGIN PGP SIGNED MESSAGE-
Hash: RIPEMD160

Peter,

In my testing with files of that size (well, larger, but still well
below the block size) it was impossible to achieve any real throughput
on the data because of the overhead of looking up the locations to all
those files on the NameNode.  Your application spends so much time
looking up file names that most of the CPUs sit idle.

A simple solution is to just load all of the small files into a sequence
file, and process the sequence file instead.

Brian

Peter McTaggart wrote:
> Hi All,
> 
> 
> 
> I am considering using HDFS for an application that potentially has many
> small files – ie  10-100 million files with an estimated average filesize of
> 50-100k (perhaps smaller) and is an online interactive application.
> 
> All of the documentation I have seen suggests that a blockszie of 64-128Mb
> works best for Hadoop/HDFS and it is best used for batch oriented
> applications.
> 
> 
> 
> Does anyone have any experience using it for files of this size  in an
> online application environment?
> 
> Is it worth pursuing HDFS for this type of application?
> 
> 
> 
> Thanks
> 
> Peter
> 
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (MingW32)
Comment: What is this? http://pgp.ardvaark.net

iD8DBQFIzlOt3YdPnMKx1eMRA18fAJ48voMDWLRiKPZHcBxAFAM1Kktk8wCguSDX
dIHsqlePzQHQYFr9AwhkI3I=
=gmAj
-END PGP SIGNATURE-


Re: streaming question

2008-09-15 Thread Christian Ulrik Søttrup
Ok, so I added the JAR to the cacheArchive option and my command looks 
like this:


hadoop jar streaming/hadoop-0.17.0-streaming.jar  -input /store/ -output 
/cout/ -mapper MyProg -combiner testlink/combiner.class -reducer 
testlink/reduce.class -file /home/hadoop/MyProg -cacheFile 
/shared/part-0#in.cl -cacheArchive /related/MyJar.jar#testlink


Now it fails because it cannot find the combiner.  The cacheArchive 
option creates a symlink in the local running directory, correct? Just 
like the cacheFile option? If not how can i then specify which class to use?


cheers,
Christian

Amareshwari Sriramadasu wrote:

Dennis Kubes wrote:
If I understand what you are asking you can use the -cacheArchive 
with the path to the jar to including the jar file in the classpath 
of your streaming job.


Dennis

You can also use -cacheArchive option to include jar file and symlink 
the unjarred directory from cwd by providing the uri as 
hdfs://#link. You have to provide -reducer and -combiner options 
as appropriate paths in the unjarred directory.


Thanks
Amareshwari

Christian Søttrup wrote:

Hi all,

I have an application that i use to run with the "hadoop jar" command.
I have now written an optimized version of the mapper in C.
I have run this using the streaming library and everything looks ok 
(using num.reducers=0).


Now i want to use this mapper together with the combiner and reducer 
from my old .jar file.
How do i do this? How can i distribute the jar and run the reducer 
and combiner from it?

While also running the c program as the mapper in streaming mode.

cheers,
Christian






Re: Why can't Hadoop be used for online applications ?

2008-09-15 Thread Steve Loughran

James Moore wrote:

On Fri, Sep 12, 2008 at 12:28 PM, Ryan LeCompte <[EMAIL PROTECTED]> wrote:

Hey Camilo,



Most of the time, I'm working on code where we're just using SQL as a
second-rate way to serialize and deserialize objects.



Aah, O/R mapping...technology to hide SQL from developers who aren't 
deemed competent to understand it. Often seen in conjunction with WS-* 
toolkits to hide XML from the same developers.


I am co-authoring a set of slides/a paper on this whole topic that I 
will stick up online soon; bits of it surfaced at a ThoughtWorks event 
in London last week.  The key concepts are


1. The past
  The whole architecture of Java Enterprise Edition is optimised for a
  deployment of applications over a small cluster of application servers
  with a (clustered) relational database behind the scenes to glue
  everything together.

2. The present
  Social applications, have a scale and a need for large datamining 
activities for
  the end users, management, operations and affiliates that cannot be 
met by
  this architecture. Hadoop and a distributed filesystem provide some 
of the

  solution, but not all.

3. The future
 We propose "Java Cloud Computing Edition", which would be an application
 architecture for future Java applications that was designed for this 
world. It would

 start with dynamic machine deployment of a distributed filesystem, run
 hadoop, HBase and the like on top, and have a front end designed to 
bind to this
 and scale out. And it would all be Open Source, to avoid getting lost 
in JSR committees

 and sabotaged by the app server and database vendors,.


Like I said, I'm still working on the idea. But I like it. And I dont 
think trying to push EJB and similar into a world with data kept on 
HBase and HDFS is the right approach.


-steve