Re: HDFS - millions of files in one directory?

2009-01-27 Thread Sagar Naik


System with: 1 billion small files.
Namenode will need to maintain the data-structure for all those files.
System will have atleast 1 block per file. And if u have replication 
factor set to 3, the system will have 3 billion blocks.
Now , if you try to read all these files in a job , you will be making 
as many as 1 billion socket connections to get these blocks. (Big 
Brothers, correct me if I m wrong)


Datanodes routinely check for available disk space and collect block 
reports. These operations are directly dependent on number of blocks on 
a datanode.


Getting all data in one file, avoids all this unnecessary  IO and memory 
occupied by namenode


Number of maps in map-reduce job are based on number of blocks. In case 
of multiple files, we will have a large number of map-tasks.


-Sagar


Mark Kerzner wrote:

Carfield,

you might be right, and I may be able to combine them in one large file.
What would one use for a delimiter, so that it would never be encountered in
normal binary files? Performance does matter (rarely it doesn't). What are
the differences in performance between using multiple files and one large
file? I would guess that one file should in fact give better hardware/OS
performance, because it is more predictable and allows buffering.

thank you,
Mark

On Sun, Jan 25, 2009 at 9:50 PM, Carfield Yim wrote:

  

Really? I thought any file can be combines as long as you can figure
out an delimiter is ok, and you really cannot have some delimiters?
Like "X"? And in the worst case, or if performance is not
really a matter, may be just encode all binary to and from ascii?

On Mon, Jan 26, 2009 at 5:49 AM, Mark Kerzner 
wrote:


Yes, flip suggested such solution, but his files are text, so he could
combine them all in a large text file, with each lined representing
  

initial


files. My files, however, are binary, so I do not see how I could combine
them.

However, since my numbers are limited by about 1 billion files total, I
should be OK to put them all in a few directories with under, say, 10,000
files each. Maybe a little balanced tree, but 3-4 four levels should
suffice.

Thank you,
Mark

On Sun, Jan 25, 2009 at 11:43 AM, Carfield Yim   

Possible simple having a file large in size instead of having a lot of
small files?

On Sat, Jan 24, 2009 at 7:03 AM, Mark Kerzner 
wrote:


Hi,

there is a performance penalty in Windows (pardon the expression) if
  

you


put


too many files in the same directory. The OS becomes very slow, stops
  

seeing


them, and lies about their status to my Java requests. I do not know
  

if


this


is also a problem in Linux, but in HDFS - do I need to balance a
  

directory


tree if I want to store millions of files, or can I put them all in
  

the


same


directory?

Thank you,
Mark
  


  


Re: decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)

2009-01-27 Thread paul
Once the nodes are listed as dead, if you still have the host names in your
conf/exclude file, remove the entries and then run hadoop dfsadmin
-refreshNodes.


This works for us on our cluster.



-paul


On Tue, Jan 27, 2009 at 5:08 PM, Bill Au  wrote:

> I was able to decommission a datanode successfully without having to stop
> my
> cluster.  But I noticed that after a node has been decommissioned, it shows
> up as a dead node in the web base interface to the namenode (ie
> dfshealth.jsp).  My cluster is relatively small and losing a datanode will
> have performance impact.  So I have a need to monitor the health of my
> cluster and take steps to revive any dead datanode in a timely fashion.  So
> is there any way to altogether "get rid of" any decommissioned datanode
> from
> the web interace of the namenode?  Or is there a better way to monitor the
> health of the cluster?
>
> Bill
>


decommissioned node showing up ad dead node in web based interface to namenode (dfshealth.jsp)

2009-01-27 Thread Bill Au
I was able to decommission a datanode successfully without having to stop my
cluster.  But I noticed that after a node has been decommissioned, it shows
up as a dead node in the web base interface to the namenode (ie
dfshealth.jsp).  My cluster is relatively small and losing a datanode will
have performance impact.  So I have a need to monitor the health of my
cluster and take steps to revive any dead datanode in a timely fashion.  So
is there any way to altogether "get rid of" any decommissioned datanode from
the web interace of the namenode?  Or is there a better way to monitor the
health of the cluster?

Bill


Re: DBOutputFormat and auto-generated keys

2009-01-27 Thread Kevin Peterson
On Mon, Jan 26, 2009 at 5:40 PM, Vadim Zaliva  wrote:

> Is it possible to obtain auto-generated IDs when writing data using
> DBOutputFormat?
>
> For example, is it possible to write Mapper which stores records in DB
> and returns auto-generated
> IDs of these records?

...

> which I would like to store in normalized for in two tables. First
> table will store
> keys (string). Each key will have unique int id auto-generated by mysql.
>
> Second table will have (key_id,value) pairs, key_id being foreign key,
> pointing to first table.
>

A mapper has to have one output format, and that output format can't pass
any data into the map, so that approach won't work. DBOutputFormat doesn't
provide any way to do it either.

If you wanted to add this kind of functionality, you would need to write
your own output format, which probably wouldn't look much like
DBOutputFormat, which would be aware of your foreign keys. It would quickly
get very complicated.

One possibility that comes to mind is writing a "HibernateOutputFormat" or
similar, which would give you a way to express the relationships between
tables, leaving your only task to hook up your persistence logic to a hadoop
output format.

I had a similar problem with writing out reports to be used by a Rails app,
and solved it by restructuring things so that I don't need to write to two
tables from the same map task.


Re: Using HDFS for common purpose

2009-01-27 Thread Jim Twensky
You may also want to have a look at this to reach a decision based on your
needs:

http://www.swaroopch.com/notes/Distributed_Storage_Systems

Jim

On Tue, Jan 27, 2009 at 1:22 PM, Jim Twensky  wrote:

> Rasit,
>
> What kind of data will you be storing on Hbase or directly on HDFS? Do you
> aim to use it as a data source to do some key/value lookups for small
> strings/numbers or do you want to store larger files labeled with some sort
> of a key and retrieve them during a map reduce run?
>
> Jim
>
>
> On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray  wrote:
>
>> Perhaps what you are looking for is HBase?
>>
>> http://hbase.org
>>
>> HBase is a column-oriented, distributed store that sits on top of HDFS and
>> provides random access.
>>
>> JG
>>
>> > -Original Message-
>> > From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
>> > Sent: Tuesday, January 27, 2009 1:20 AM
>> > To: core-user@hadoop.apache.org
>> > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
>> > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
>> > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
>> > Subject: Using HDFS for common purpose
>> >
>> > Hi,
>> > I wanted to ask, if HDFS is a good solution just as a distributed db
>> > (no
>> > running jobs, only get and put commands)
>> > A review says that "HDFS is not designed for low latency" and besides,
>> > it's
>> > implemented in Java.
>> > Do these disadvantages prevent us using it?
>> > Or could somebody suggest a better (faster) one?
>> >
>> > Thanks in advance..
>> > Rasit
>>
>>
>


Re: Using HDFS for common purpose

2009-01-27 Thread Jim Twensky
Rasit,

What kind of data will you be storing on Hbase or directly on HDFS? Do you
aim to use it as a data source to do some key/value lookups for small
strings/numbers or do you want to store larger files labeled with some sort
of a key and retrieve them during a map reduce run?

Jim

On Tue, Jan 27, 2009 at 11:51 AM, Jonathan Gray  wrote:

> Perhaps what you are looking for is HBase?
>
> http://hbase.org
>
> HBase is a column-oriented, distributed store that sits on top of HDFS and
> provides random access.
>
> JG
>
> > -Original Message-
> > From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
> > Sent: Tuesday, January 27, 2009 1:20 AM
> > To: core-user@hadoop.apache.org
> > Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
> > hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
> > hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
> > Subject: Using HDFS for common purpose
> >
> > Hi,
> > I wanted to ask, if HDFS is a good solution just as a distributed db
> > (no
> > running jobs, only get and put commands)
> > A review says that "HDFS is not designed for low latency" and besides,
> > it's
> > implemented in Java.
> > Do these disadvantages prevent us using it?
> > Or could somebody suggest a better (faster) one?
> >
> > Thanks in advance..
> > Rasit
>
>


Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-27 Thread Yuanyuan Tian

Yes, I did run fsck after upgrade. No error message. Everything is "OK".

yy



   
 Brian Bockelman   
   To
   core-user@hadoop.apache.org 
 01/27/2009 08:57   cc
 AM
   Subject
   Re: files are inaccessible after
 Please respond to HDFS upgrade from 0.18.1 to 1.19.0
 core-u...@hadoop. 
apache.org 
   
   
   
   




Hey YY,

At a more basic level -- have you run fsck on that file?  What were
the results?

Brian

On Jan 27, 2009, at 10:54 AM, Bill Au wrote:

> Did you start your namenode with the -upgrade after upgrading from
> 0.18.1 to
> 0.19.0?
>
> Bill
>
> On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian 
> wrote:
>
>>
>>
>> Hi,
>>
>> I just upgraded hadoop from 0.18.1 to 0.19.0 following the
>> instructions on
>> http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run
>> fsck,
>> everything seems fine. All the files can be listed in hdfs and the
>> sizes
>> are also correct. But when a mapreduce job tries to read the files as
>> input, the following error messages are returned for some of the
>> files:
>>
>> java.io.IOException: Could not obtain block:
>> blk_-2827537120880440835_1131
>> file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp
>>at org.apache.hadoop.hdfs.DFSClient
>> $DFSInputStream.chooseDataNode(DFSClient.java:1708)
>>at org.apache.hadoop.hdfs.DFSClient
>> $DFSInputStream.blockSeekTo
>> (DFSClient.java:1536)
>>at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read
>> (DFSClient.java:1663)
>>at java.io.DataInputStream.read(DataInputStream.java:150)
>>at java.io.ObjectInputStream$PeekInputStream.read
>> (ObjectInputStream.java:2283)
>>at java.io.ObjectInputStream$PeekInputStream.readFully
>> (ObjectInputStream.java:2296)
>>at java.io.ObjectInputStream
>> $BlockDataInputStream.readShort
>> (ObjectInputStream.java:2767)
>>at java.io.ObjectInputStream.readStreamHeader
>> (ObjectInputStream.java:798)
>>at java.io.ObjectInputStream.(ObjectInputStream.java:298)
>>at
>>
>> emailanalytics.importer.parallelimport.EmailContentRecordReader.
>> (EmailContentRecordReader.java:32)
>>
>>at
>> emailanalytics
>> .importer.parallelimport.EmailContentFormat.getRecordReader
>> (EmailContentFormat.java:20)
>>at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
>>at org.apache.hadoop.mapred.Child.main(Child.java:155)
>>
>> I also tried to browse these files through the HDFS web interface,
>> java.io.EOFException is returned.
>>
>> Is there any way to recover the files?
>>
>> Thanks very much,
>>
>> YY



Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-27 Thread Yuanyuan Tian

Yes, I did that. But there is some error message that asks me to rollback
first. So, I ended up a -rollback first and then and -upgrade.

yy



   
 Bill Au   
To
   core-user@hadoop.apache.org 
 01/27/2009 08:54   cc
 AM
   Subject
   Re: files are inaccessible after
 Please respond to HDFS upgrade from 0.18.1 to 1.19.0
 core-u...@hadoop. 
apache.org 
   
   
   
   




Did you start your namenode with the -upgrade after upgrading from 0.18.1
to
0.19.0?

Bill

On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian  wrote:

>
>
> Hi,
>
> I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions
on
> http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck,
> everything seems fine. All the files can be listed in hdfs and the sizes
> are also correct. But when a mapreduce job tries to read the files as
> input, the following error messages are returned for some of the files:
>
> java.io.IOException: Could not obtain block:
blk_-2827537120880440835_1131
> file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp
> at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.blockSeekTo
> (DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read
> (DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:150)
> at java.io.ObjectInputStream$PeekInputStream.read
> (ObjectInputStream.java:2283)
> at java.io.ObjectInputStream$PeekInputStream.readFully
> (ObjectInputStream.java:2296)
> at java.io.ObjectInputStream$BlockDataInputStream.readShort
> (ObjectInputStream.java:2767)
> at java.io.ObjectInputStream.readStreamHeader
> (ObjectInputStream.java:798)
> at java.io.ObjectInputStream.(ObjectInputStream.java:298)
> at
>
>
emailanalytics.importer.parallelimport.EmailContentRecordReader.(EmailContentRecordReader.java:32)

>
> at
> emailanalytics.importer.parallelimport.EmailContentFormat.getRecordReader
> (EmailContentFormat.java:20)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
> at org.apache.hadoop.mapred.Child.main(Child.java:155)
>
> I also tried to browse these files through the HDFS web interface,
> java.io.EOFException is returned.
>
> Is there any way to recover the files?
>
> Thanks very much,
>
> YY


RE: Using HDFS for common purpose

2009-01-27 Thread Jonathan Gray
Perhaps what you are looking for is HBase?

http://hbase.org

HBase is a column-oriented, distributed store that sits on top of HDFS and 
provides random access.

JG

> -Original Message-
> From: Rasit OZDAS [mailto:rasitoz...@gmail.com]
> Sent: Tuesday, January 27, 2009 1:20 AM
> To: core-user@hadoop.apache.org
> Cc: arif.yil...@uzay.tubitak.gov.tr; emre.gur...@uzay.tubitak.gov.tr;
> hilal.tara...@uzay.tubitak.gov.tr; serdar.ars...@uzay.tubitak.gov.tr;
> hakan.kocaku...@uzay.tubitak.gov.tr; caglar.bi...@uzay.tubitak.gov.tr
> Subject: Using HDFS for common purpose
> 
> Hi,
> I wanted to ask, if HDFS is a good solution just as a distributed db
> (no
> running jobs, only get and put commands)
> A review says that "HDFS is not designed for low latency" and besides,
> it's
> implemented in Java.
> Do these disadvantages prevent us using it?
> Or could somebody suggest a better (faster) one?
> 
> Thanks in advance..
> Rasit



Re: HDFS - millions of files in one directory?

2009-01-27 Thread Philip (flip) Kromer
Tossing one more on this king of all threads:
Stuart Sierra of AltLaw wrote a nice little tool to serialize tar.bz2 files
into SequenceFile, with filename as key and its contents a BLOCK-compressed
blob.
  http://stuartsierra.com/2008/04/24/a-million-little-files

flip


On Mon, Jan 26, 2009 at 3:20 PM, Mark Kerzner  wrote:

> Jason, this is awesome, thank you.
> By the way, is there a book or manual with "best practices?"
>
> On Mon, Jan 26, 2009 at 3:13 PM, jason hadoop  >wrote:
>
> > Sequence files rock, and you can use the
> > *
> > bin/hadoop dfs -text FILENAME* command line tool to get a toString level
> > unpacking of the sequence file key,value pairs.
> >
> > If you provide your own key or value classes, you will need to implement
> a
> > toString method to get some use out of this. Also, your class path will
> > need
> > to include the jars with your custom key/value classes.
> >
> > HADOOP_CLASSPATH="myjar1;myjar2..." *bin/hadoop dfs -text FILENAME*
> >
> >
> > On Mon, Jan 26, 2009 at 1:08 PM, Mark Kerzner 
> > wrote:
> >
> > > Thank you, Doug, then all is clear in my head.
> > > Mark
> > >
> > > On Mon, Jan 26, 2009 at 3:05 PM, Doug Cutting 
> > wrote:
> > >
> > > > Mark Kerzner wrote:
> > > >
> > > >> Okay, I am convinced. I only noticed that Doug, the originator, was
> > not
> > > >> happy about it - but in open source one has to give up control
> > > sometimes.
> > > >>
> > > >
> > > > I think perhaps you misunderstood my remarks.  My point was that, if
> > you
> > > > looked to Nutch's Content class for an example, it is, for historical
> > > > reasons, somewhat more complicated than it needs to be and is thus a
> > less
> > > > than perfect example.  But using SequenceFile to store web content is
> > > > certainly a best practice and I did not mean to imply otherwise.
> > > >
> > > > Doug
> > > >
> > >
> >
>



-- 
http://www.infochimps.org
Connected Open Free Data


Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-27 Thread Brian Bockelman

Hey YY,

At a more basic level -- have you run fsck on that file?  What were  
the results?


Brian

On Jan 27, 2009, at 10:54 AM, Bill Au wrote:

Did you start your namenode with the -upgrade after upgrading from  
0.18.1 to

0.19.0?

Bill

On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian   
wrote:





Hi,

I just upgraded hadoop from 0.18.1 to 0.19.0 following the  
instructions on
http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run  
fsck,
everything seems fine. All the files can be listed in hdfs and the  
sizes

are also correct. But when a mapreduce job tries to read the files as
input, the following error messages are returned for some of the  
files:


java.io.IOException: Could not obtain block:  
blk_-2827537120880440835_1131

file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp
   at org.apache.hadoop.hdfs.DFSClient
$DFSInputStream.chooseDataNode(DFSClient.java:1708)
   at org.apache.hadoop.hdfs.DFSClient 
$DFSInputStream.blockSeekTo

(DFSClient.java:1536)
   at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read
(DFSClient.java:1663)
   at java.io.DataInputStream.read(DataInputStream.java:150)
   at java.io.ObjectInputStream$PeekInputStream.read
(ObjectInputStream.java:2283)
   at java.io.ObjectInputStream$PeekInputStream.readFully
(ObjectInputStream.java:2296)
   at java.io.ObjectInputStream 
$BlockDataInputStream.readShort

(ObjectInputStream.java:2767)
   at java.io.ObjectInputStream.readStreamHeader
(ObjectInputStream.java:798)
   at java.io.ObjectInputStream.(ObjectInputStream.java:298)
   at

emailanalytics.importer.parallelimport.EmailContentRecordReader. 
(EmailContentRecordReader.java:32)


   at
emailanalytics 
.importer.parallelimport.EmailContentFormat.getRecordReader

(EmailContentFormat.java:20)
   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
   at org.apache.hadoop.mapred.Child.main(Child.java:155)

I also tried to browse these files through the HDFS web interface,
java.io.EOFException is returned.

Is there any way to recover the files?

Thanks very much,

YY




Re: files are inaccessible after HDFS upgrade from 0.18.1 to 1.19.0

2009-01-27 Thread Bill Au
Did you start your namenode with the -upgrade after upgrading from 0.18.1 to
0.19.0?

Bill

On Mon, Jan 26, 2009 at 8:18 PM, Yuanyuan Tian  wrote:

>
>
> Hi,
>
> I just upgraded hadoop from 0.18.1 to 0.19.0 following the instructions on
> http://wiki.apache.org/hadoop/Hadoop_Upgrade. After upgrade, I run fsck,
> everything seems fine. All the files can be listed in hdfs and the sizes
> are also correct. But when a mapreduce job tries to read the files as
> input, the following error messages are returned for some of the files:
>
> java.io.IOException: Could not obtain block: blk_-2827537120880440835_1131
> file=/user/hmail/NSF/50k_nntp_clean2.nsf.fs.kvp
> at org.apache.hadoop.hdfs.DFSClient
> $DFSInputStream.chooseDataNode(DFSClient.java:1708)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.blockSeekTo
> (DFSClient.java:1536)
> at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.read
> (DFSClient.java:1663)
> at java.io.DataInputStream.read(DataInputStream.java:150)
> at java.io.ObjectInputStream$PeekInputStream.read
> (ObjectInputStream.java:2283)
> at java.io.ObjectInputStream$PeekInputStream.readFully
> (ObjectInputStream.java:2296)
> at java.io.ObjectInputStream$BlockDataInputStream.readShort
> (ObjectInputStream.java:2767)
> at java.io.ObjectInputStream.readStreamHeader
> (ObjectInputStream.java:798)
> at java.io.ObjectInputStream.(ObjectInputStream.java:298)
> at
>
> emailanalytics.importer.parallelimport.EmailContentRecordReader.(EmailContentRecordReader.java:32)
>
> at
> emailanalytics.importer.parallelimport.EmailContentFormat.getRecordReader
> (EmailContentFormat.java:20)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:321)
> at org.apache.hadoop.mapred.Child.main(Child.java:155)
>
> I also tried to browse these files through the HDFS web interface,
> java.io.EOFException is returned.
>
> Is there any way to recover the files?
>
> Thanks very much,
>
> YY


Funding opportunities for Academic Research on/near Hadoop

2009-01-27 Thread Steve Loughran



This is a little note to advise universities working on Hadoop related 
projects, that they may

be able to get some money and cluster time for some fun things

http://www.hpl.hp.com/open_innovation/irp/

"The HP Labs Innovation Research Program is designed to create 
opportunities -- at colleges,
universities and research institutes around the world -- for 
breakthrough collaborative research

with HP.

HP Labs is proud to announce the 2009 Innovation Research Program (IRP). 
Through this open Call for
Proposals, we are soliciting your best ideas on a range of topics with 
the goal of establishing new
research collaborations. Proposals will be invited against targeted IRP 
Research Topics, and will be
accepted via an online submission tool. They will be reviewed by HP Labs 
scientists for alignment

with the selected research topic and impact of the proposed research.

Awards under the 2009 HP Labs Innovation Research Program are primarily 
intended to provide
financial support for a graduate student to assist the Principal 
Investigator conducting a
collaborative research project with HP Labs. Consequently, awards will 
provide cash support for one

year in the range of USD $50,000 to $75,000, including any overhead."

If you look at the research topics there is a PDF file listing topics of 
interest, of which three

general categories may be of interest:
http://www.hpl.hp.com/open_innovation/irp/topics_2009.html

1. "Intelligent Infrastructure" - very large scale storage systems, 
management, etc.
2. Sustainability -especially sustainable datacentres: how to measure 
application power consumption,
   and improve it; how to include knowledge of the physical 
infrastructure in computation
3. "Cloud" - Large-scale computing frameworks, Data management and 
security, Federation of
   heterogeneous cloud sites, Programming tools and mash-ups, Complex 
event processing and
   management, Massive-Scale Data Analytics, Cloud monitoring and 
management.


If you look at that Cloud topic, not only does Hadoop-related work seem 
to fit in, the call for
proposals is fairly explicit in mentioning the ecosystems suitability as 
a platform for your work.
Which makes sense, as it is the only very-large-scale data-centric 
computing platform out there for
which the source code is freely available. Yet also, because it is open 
source, it is a place where,
university permitting, your research can be contributed back to the 
community, and used by grateful

users the world over.

What is also interesting is that little line at the bottom, "We 
encourage investigators to utilize
the capabilities in the Open Cirrus testbed as well as to share their 
experience, data, and
algorithms with other researchers using the testbed. "  Which implies 
that cluster time on the new cross-company, cross-university homogenous 
datacentre test bed should be available to test your ideas.


If you are at university, have a look at the proposals and see if you 
can come up with a proposal
for innovative work in this area. The timescales are fairly aggressive 
-that is to ensure that
proposers will know early on whether or not they were successful, and 
the money will be in their

University's hands for the next academic year.

-Steve

(for followup queries, follow the links on the site or email me direct; 
I am vaguely involved in some of this)


[ANNOUNCE] Registration for ApacheCon Europe 2009 is now open!

2009-01-27 Thread Owen O'Malley

All,
   I'm broadcasting this to all of the Hadoop dev and users lists,  
however, in the future I'll only send cross-subproject announcements  
to gene...@hadoop.apache.org. Please subscribe over there too! It is  
very low traffic.
  Anyways, ApacheCon Europe is coming up in March. There are a range  
of Hadoop talks being given:


Introduction to Hadoop by Owen O'Malley
Hadoop Map/Reduce: Tuning and Debugging by Arun Murthy
Pig - Making Hadoop Easy by Olga Natkovich
Running Hadoop in the Cloud by Tom White
Architectures for the Cloud by Steve Loughran
Configuring Hadoop for Grid Services by Allen Wittenauer
Dynamic Hadoop Clusters by Steve Loughran
HBasics: An Introduction to Hadoop's Bid Data Database by Michael Stack
Hadoop Tools and Tricks for Data Pipelines by Christophe Bisciglia
Introducing Mahout: Apache Machine Learning by Grant Ingersoll

-- Owen

Begin forwarded message:


From: Shane Curcuru 
Date: January 27, 2009 6:15:25 AM PST
Subject: [ANN] Registration for ApacheCon Europe 2009 is now open!

PMC moderators - please forward the below to any appropriate dev@ or  
users@ lists so your larger community can hear about ApacheCon  
Europe. Remember, ACEU09 has scheduled sessions spanning the breadth  
of the ASF's projects, subprojects, and podlings, including at  
least: ActiveMQ, SerivceMix, CXF, Axis2, Hadoop, Felix, Sling,  
Maven, Struts, Roller, Shindig, Geronimo, Lucene, Solr, BSF, Mina,  
Directory, Tomcat, httpd, Mahout, Bayeux, CouchDB, AntUnit,  
Jackrabbit, Archiva, Wicket, POI, Pig, Synapse, Droids, Continuum.



ApacheCon EU 2009 registration is now open!
23-27 March -- Mövenpick Hotel, Amsterdam, Netherlands
http://www.eu.apachecon.com/


Registration for ApacheCon Europe 2009 is now open - act before early
bird prices expire 6 February.  Remember to book a room at the  
Mövenpick

and use the Registration Code: Special package attendees for the
conference registration, and get 150 Euros off your full conference
registration.

Lower Costs - Thanks to new VAT tax laws, our prices this year are 19%
lower than last year in Europe!  We've also negotiated a Mövenpick  
rate

of a maximum of 155 Euros per night for attendees in our room block.

Quick Links:

  http://xrl.us/aceu09sp  See the schedule
  http://xrl.us/aceu09hp  Get your hotel room
  http://xrl.us/aceu09rp  Register for the conference

Other important notes:

- Geeks for Geeks is a new mini-track where we can feature advanced
technical content from project committers.  And our Hackathon on  
Monday

and Tuesday is open to all attendees - be sure to check it off in your
registration.

- The Call for Papers for ApacheCon US 2009, held 2-6 November
2009 in Oakland, CA, is open through 28 February, so get your
submissions in now.  This ApacheCon will feature special events with
some of the ASF's original founders in celebration of the 10th
anniversary of The Apache Software Foundation.

  http://www.us.apachecon.com/c/acus2009/

- Interested in sponsoring the ApacheCon conferences?  There are  
plenty

of sponsor packages available - please contact Delia Frees at
de...@apachecon.com for further information.

==
ApacheCon EU 2008: A week of Open Source at it's best!

Hackathon - open to all! | Geeks for Geeks | Lunchtime Sessions
In-Depth Trainings | Multi-Track Sessions | BOFs | Business Panel
Lightning Talks | Receptions | Fast Feather Track | Expo... and more!

- Shane Curcuru, on behalf of
 Noirin Shirley, Conference Lead,
 and the whole ApacheCon Europe 2009 Team
 http://www.eu.apachecon.com/  23-27 March -- Amsterdam, Netherlands






Number of records in a MapFile

2009-01-27 Thread Andy Liu
Is there a way to programatically get the number of records in a MapFile
without doing a complete scan?


Re: Where are the meta data on HDFS ?

2009-01-27 Thread Rasit OZDAS
Hi Tien,

Configuration config = new Configuration(true);
config.addResource(new Path("/etc/hadoop-0.19.0/conf/hadoop-site.xml"));

FileSystem fileSys = FileSystem.get(config);
BlockLocation[] locations = fileSys.getFileBlockLocations(.

I copied some lines of my code, it can also help if you prefer using the
API.
It has other useful features (methods) as well.
http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html


2009/1/24 tienduc_dinh 

>
> that's what I needed !
>
> Thank you so much.
> --
> View this message in context:
> http://www.nabble.com/Where-are-the-meta-data-on-HDFS---tp21634677p21644206.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
M. Raşit ÖZDAŞ


Re: Zeroconf for hadoop

2009-01-27 Thread Steve Loughran

Edward Capriolo wrote:

Zeroconf is more focused on simplicity then security. One of the
original problems that may have been fixes is that any program can
announce any service. IE my laptop can announce that it is the DNS for
google.com etc.



-1 to zeroconf as it is way too chatty. Every DNS lookup is mcast, in a 
busy network a lot of CPU time is spent discarding requests. Nor does it 
handle failure that well. It's OK on a home LAN to find a music player, 
but not what you want for a HA infrastructure in the datacentre,


Our LAN discovery tool -Anubis -uses mcast only to do the initial 
discovery, then they have voting and things to select a nominated server 
that everyone just unicasts too at that point; failure of that 
node/network partition triggers a rebinding.


See: http://wiki.smartfrog.org/wiki/display/sf/Anubis  ; the paper 
discusses some of the fun you have, though that paper doesn't also 
include clock drift issue you can encounter when running Xen or 
VMWare-hosted nodes.




I want to mention a related topic to the list. People are approaching
the auto-discovery in a number of ways jira. There are a few ways I
can think of to discover hadoop. A very simple way might be to publish
the configuration over a web interface. I use a network storage system
called gluster-fs. Gluster can be configured so the server holds the
configuration for each client. If the hadoop name node held the entire
configuration for all the nodes the namenode would only need to be
aware of the namenode and it could retrieve its configuration from it.

Having a central configuration management or a discovery system would
be very useful. HOD is what I think to be the closest thing it is more
of a top down deployment system.


Allen is a fan of a well managed cluster; he pushes out Hadoop as RPMs 
via PXE and Kickstart and uses LDAP as the central CM tool. I am 
currently exploring bringing up virtual clusters by
 * putting the relevant RPMs out to all nodes; same files/conf for 
every node,
 * having custom configs for Namenode and job tracker; everything else 
becomes a Datanode with a task tracker bound to the masters.
I will start worrying about discovery afterwards, because without the 
ability for the Job Tracker or Namenode to do failover to a fallback Job 
Tracker or Namenode, you don't really need so much in the way of dynamic 
cluster binding.


-steve


Re: Interrupting JobClient.runJob

2009-01-27 Thread Amareshwari Sriramadasu

Edwin wrote:

Hi

I am looking for a way to interrupt a thread that entered
JobClient.runJob(). The runJob() method keep polling the JobTracker until
the job is completed. After reading the source code, I know that the
InterruptException is caught in runJob(). Thus, I can't interrupt it using
Thread.interrupt() call. Is there anyway I can interrupt a polling thread
without terminating the job? If terminating the job is the only way to
escape, how can I terminate the current job?

Thank you very much.

Regards
Edwin

  

Yes. there is noway to stop the client from polling.
If you want to Stop client thread, use +c or kill the client 
process itself.


You can kill a job using the command:
bin/hadoop job -kill 

-Amareshwari


Interrupting JobClient.runJob

2009-01-27 Thread Edwin
Hi

I am looking for a way to interrupt a thread that entered
JobClient.runJob(). The runJob() method keep polling the JobTracker until
the job is completed. After reading the source code, I know that the
InterruptException is caught in runJob(). Thus, I can't interrupt it using
Thread.interrupt() call. Is there anyway I can interrupt a polling thread
without terminating the job? If terminating the job is the only way to
escape, how can I terminate the current job?

Thank you very much.

Regards
Edwin


Using HDFS for common purpose

2009-01-27 Thread Rasit OZDAS
Hi,
I wanted to ask, if HDFS is a good solution just as a distributed db (no
running jobs, only get and put commands)
A review says that "HDFS is not designed for low latency" and besides, it's
implemented in Java.
Do these disadvantages prevent us using it?
Or could somebody suggest a better (faster) one?

Thanks in advance..
Rasit