Re: hadoop hardware configuration

2009-05-29 Thread stephen mulcahy

Brian Bockelman wrote:
I'm not a hardware guy anymore, but I'd personally prefer a software 
RAID.  I've seen mirrored disks go down because the RAID controller 
decided to puke.


+1 on this. I've seen a number of hardware RAID failures in the last 2 
years and in each case the controller mangled the disks before finally 
giving up the ghost.


I'm very much inclined to prefer software RAID in light of this (and the 
fact that low-end RAID controllers have poor performance).


-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


Re: org.apache.hadoop.ipc.client : trying connect to server failed

2009-05-29 Thread Steve Loughran

ashish pareek wrote:

Yes I am able to ping and ssh between two virtual machine and even i
have set ip address of both the virtual machines in their respective
/etc/hosts file ...

thanx for reply .. if you suggest some other thing which i could
have missed or any remedy 

Regards,
Ashish Pareek.


VMs? VMWare? Xen? Something else?

I've encountered problems on virtual networks where the machines aren't 
locatable via DNS., and can't be sure who they say they are.


1. start the machines individually, instead of the start-all script that 
needs to have SSH working too.

2. check with netstat -a to see what ports/interfaces they are listening on

-steve



Hadoop on demand intelligence with persistent HDFS.

2009-05-29 Thread Brock Palen

We have an existing HPC cluster which runs torque+moab.

There has been requests for a persistent HDFS on a subset of the  
nodes (~20 nodes of the 800).


We can use Moab to force HOD jobs to go only to those 20 nodes.  Thus  
there should be hadoop map-reduce workers on nodes that have HDFS  
local also.


Question is then, will work given to an HOD instance, will hadoop  
move the computation to the closest nodes with the data on HDFS?


I know hadoop does this when it runs the whole cluster,  but what  
does HOD do with external HDFS, even if HOD nodes overlap some (maybe  
not all) the HDFS nodes.


Rack locality wont matter, currently, the 20 nodes will be all the  
same blade,


Our goal is to keep running our normal HPC workload, but provide a  
HDFS that sticks around, and also provides decent performance,  
relative to normal Hadoop clusters.




Brock Palen
bro...@mlds-networks.com
www.mlds-networks.com
MLDS Owner Senior Tech.




London Open Source Search meetup - Mon 15th June Click to flag this post

2009-05-29 Thread René Kriegler

Hi all,

We are organising another open source search social evening (OSSSE?) in
London on Monday the 15th of June.

The plan is to get together and chat about search technology, from Lucene to
Solr, Hadoop, Mahout, Ferret and the like. We are planning on making this a
regular event, bringing together people from across
the field to discuss ideas and ask questions over a quiet drink.

Please come along if you can.

René & Rich (Marr)



Public notice:
http://richmarr.wordpress.com/2009/05/28/open-source-search-social/

Yahoo event page:
http://upcoming.yahoo.com/event/2788561/
-- 
View this message in context: 
http://www.nabble.com/London-Open-Source-Search-meetup---Mon-15th-June-Click-to-flag-this-post-tp23781865p23781865.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: New version/API stable?

2009-05-29 Thread David Rosenstrauch
Thanks much for the advice, Alex.  Sounds like I'm probably best off 
sticking with 0.18.3 for right now, with an eye on upgrading to 0.20 soon.


Thanks again!

DR

Alex Loddengaard wrote:

0.19 is considered unstable by us at Cloudera and by the Y! folks; they
never deployed it to their clusters.  That said, we recommend 0.18.3 as the
most stable version of Hadoop right now.  Y! has (or will soon) deploy(ed)
0.20, which implies that it's at least stable enough for them to give it a
go.  Cloudera plans to support 0.20 as soon as a few more bugs get flushed
out, which will probably happen in its next minor release.

So anyway, that said, it might make sense for you to start with 0.20.0, as
long as you understand that the first major release usually is pretty buggy,
and is basically considered a beta.  If you're not willing to take the
stability risk, then I'd recommend going with 0.18.3, though the upgrade
from 0.18.3 to 0.20.X is going to be a headache (APIs changed, configuration
files changed, etc.).

Hope this is insightful.

Alex




Re: Reduce() time takes ~4x Map()

2009-05-29 Thread David Batista
Hi Jothi, thanks for your answer.

I have:

Average time taken by Map tasks: 37sec

Average time taken by Shuffle: 35mins, 12sec

Average time taken by Reduce tasks: 1sec

--
./david



2009/5/29 Jothi Padmanabhan :
> Hi David,
>
> If you go to JobTrackerHistory and then click on this job and then do
> Analyse This Job, you should be able to get the split up timings for the
> individual phases of the map and reduce tasks, including the average, best
> and worst times. Could you provide those numbers so that we can get a better
> idea of how the job progressed.
>
> Jothi
>
>
> On 5/28/09 10:11 PM, "David Batista"  wrote:
>
>> Hi everyone,
>>
>> I'm processing XML files, around 500MB each with several documents,
>> for the map() function I pass a document from the XML file, which
>> takes some time to process depending on the size - I'm applying NER to
>> texts.
>>
>> Each document has a unique identifier, so I'm using that identifier as
>> a key and the results of parsing the document in one string as the
>> output:
>>
>> so at the end of  the map function():
>> output.collect( new Text(identifier), new Text(outputString));
>>
>> usually the outputString is around 1k-5k size
>>
>> reduce():
>> public void reduce(Text key, Iterator values,
>> OutputCollector output, Reporter reporter) {
>> while (values.hasNext()) {
>> Text text = values.next();
>> try {
>> output.collect(key, text);
>> } catch (IOException e) {
>> // TODO Auto-generated catch block
>> e.printStackTrace();
>> }
>> }
>> }
>>
>> I did a test using only 1 machine with 8 cores, and only 1 XML file,
>> it took around 3 hours to process all maps and  ~12hours for the
>> reduces!
>>
>> the XML file has 139 945 documents
>>
>> I set the jobconf for 1000 maps() and 200 reduces()
>>
>> I did took a look at graphs on the web interface during the reduce
>> phase, and indeed its the copy phase that's taking much of the time,
>> the sort and reduce phase are done almost instantly.
>>
>> Why does the copy phase takes so long? I understand that the copies
>> are made using HTTP, and the data was in really small chunks 1k-5k
>> size, but even so, being everything in the same physical machine
>> should have been faster, no?
>>
>> Any suggestions on what might be causing the copies in reduce to take so 
>> long?
>> --
>> ./david
>
>


Re: Reduce() time takes ~4x Map()

2009-05-29 Thread David Batista
Hi Jason,

Yes, my keys are unique, that I'm certain of.

I don't need actually to sort them, but it would save me some extra
work if the output came out in the same file name as the input, for
instance:

having as input:

big_xml_file1.xml
big_xml_file2.xml
..
big_xml_file40.xml

on each being several documents, it would be good to have the output
data according to the file where the document was taken from:

output_xml_file1.xml
output_xml_file2.xml
...
output_xml_file40.xml


I did run a teste now, while writing this email, and it seems its
working as expected, what I did was to use this as output format:

conf.setOutputFormat(KeyBasedMultipleTextOutputFormat.class)

which I found here:
http://www.mail-archive.com/core-user@hadoop.apache.org/msg05707.html

outputdata is stored in a file with the same name as the input file
where it was taken from

and I've set the Reduces to 0, suggestion from Miles Osborne, so no
sorting is done.

It seems everything is working perfectly now!

Thanks for all the feedback

--
./david



2009/5/29 jason hadoop :
> At the minimal level, enable map output compression, it may make some
> difference, mapred.compress.map.output.
> Sorting is very expensive when there are many keys and the values are large.
> Are you quite certain your keys are unique.
> Also, do you need them sorted by document id?
>
>
> On Thu, May 28, 2009 at 8:51 PM, Jothi Padmanabhan 
> wrote:
>
>> Hi David,
>>
>> If you go to JobTrackerHistory and then click on this job and then do
>> Analyse This Job, you should be able to get the split up timings for the
>> individual phases of the map and reduce tasks, including the average, best
>> and worst times. Could you provide those numbers so that we can get a
>> better
>> idea of how the job progressed.
>>
>> Jothi
>>
>>
>> On 5/28/09 10:11 PM, "David Batista"  wrote:
>>
>> > Hi everyone,
>> >
>> > I'm processing XML files, around 500MB each with several documents,
>> > for the map() function I pass a document from the XML file, which
>> > takes some time to process depending on the size - I'm applying NER to
>> > texts.
>> >
>> > Each document has a unique identifier, so I'm using that identifier as
>> > a key and the results of parsing the document in one string as the
>> > output:
>> >
>> > so at the end of  the map function():
>> > output.collect( new Text(identifier), new Text(outputString));
>> >
>> > usually the outputString is around 1k-5k size
>> >
>> > reduce():
>> > public void reduce(Text key, Iterator values,
>> > OutputCollector output, Reporter reporter) {
>> > while (values.hasNext()) {
>> > Text text = values.next();
>> > try {
>> > output.collect(key, text);
>> > } catch (IOException e) {
>> > // TODO Auto-generated catch block
>> > e.printStackTrace();
>> > }
>> > }
>> > }
>> >
>> > I did a test using only 1 machine with 8 cores, and only 1 XML file,
>> > it took around 3 hours to process all maps and  ~12hours for the
>> > reduces!
>> >
>> > the XML file has 139 945 documents
>> >
>> > I set the jobconf for 1000 maps() and 200 reduces()
>> >
>> > I did took a look at graphs on the web interface during the reduce
>> > phase, and indeed its the copy phase that's taking much of the time,
>> > the sort and reduce phase are done almost instantly.
>> >
>> > Why does the copy phase takes so long? I understand that the copies
>> > are made using HTTP, and the data was in really small chunks 1k-5k
>> > size, but even so, being everything in the same physical machine
>> > should have been faster, no?
>> >
>> > Any suggestions on what might be causing the copies in reduce to take so
>> long?
>> > --
>> > ./david
>>
>>
>
>
> --
> Alpha Chapters of my book on Hadoop are available
> http://www.apress.com/book/view/9781430219422
> www.prohadoopbook.com a community for Hadoop Professionals
>


Question: index package in contrib (lucene index)

2009-05-29 Thread Tenaali Ram
Anyone ?
Any help to understand this package is appreciated.

Thanks,
T

On Thu, May 28, 2009 at 3:18 PM, Tenaali Ram  wrote:

> Hi,
>
> I am trying to understand the code of index package to build a distributed
> Lucene index. I have some very basic questions and would really appreciate
> if someone can help me understand this code-
>
> 1) If I already have Lucene index (divided into shards), should I upload
> these indexes into HDFS and provide its location or the code will pick these
> shards from local file system ?
>
> 2) How is the code adding a document in the lucene index, I can see there
> is a index selection policy. Assuming round robin policy is chosen, how is
> the code adding a document in the lucene index? This is related to first
> question - is the index where the new document is to be added in HDFS or in
> local file system. I read in the README that the index is first created on
> local file system, then copied back to HDFS. Can someone please point me to
> the code that is doing this.
>
> 3) After the map reduce job finishes, where are the final indexes ? In HDFS
> ?
>
> 4) Correct me if I am wrong- the code builds multiple indexes, where each
> index is an instance of Lucene Index having a disjoint subset of documents
> from the corpus. So, if I have to search a term, I have to search each index
> and then merge the result. If this is correct, then how is the IDF of a term
> which is a global statistic computed and updated in each index ? I mean each
> index can compute the IDF wrt. to the subset of documents that it has, but
> can not compute the global IDF of a term (since it knows nothing about other
> indexes, which might have the same term in other documents).
>
> Thanks,
> -T
>
>
>


Re: Question: index package in contrib (lucene index)

2009-05-29 Thread Jun Rao
Reply inlined below.

Jun
IBM Almaden Research Center
K55/B1, 650 Harry Road, San Jose, CA  95120-6099

jun...@almaden.ibm.com


Tenaali Ram  wrote on 05/28/2009 03:18:53 PM:

> Hi,
>
> I am trying to understand the code of index package to build a
distributed
> Lucene index. I have some very basic questions and would really
appreciate
> if someone can help me understand this code-
>
> 1) If I already have Lucene index (divided into shards), should I upload
> these indexes into HDFS and provide its location or the code will pick
these
> shards from local file system ?

Yes, you need to put the old index to HDFS first.

>
> 2) How is the code adding a document in the lucene index, I can see there
is
> a index selection policy. Assuming round robin policy is chosen, how is
the
> code adding a document in the lucene index? This is related to first
> question - is the index where the new document is to be added in HDFS or
in
> local file system. I read in the README that the index is first created
on
> local file system, then copied back to HDFS. Can someone please point me
to
> the code that is doing this.
>

See contrib.index.example.

> 3) After the map reduce job finishes, where are the final indexes ? In
HDFS
> ?

They will be in HDFS.

>
> 4) Correct me if I am wrong- the code builds multiple indexes, where each
> index is an instance of Lucene Index having a disjoint subset of
documents
> from the corpus. So, if I have to search a term, I have to search each
index
> and then merge the result. If this is correct, then how is the IDF of a
term
> which is a global statistic computed and updated in each index ? I mean
each
> index can compute the IDF wrt. to the subset of documents that it has,
but
> can not compute the global IDF of a term (since it knows nothing about
other
> indexes, which might have the same term in other documents).
>

This package only deals with index builds. The shards are disjoint and it's
up to the index server to calculate the ranks. For distributed TF/IDF
support, you may want to look into Katta.

> Thanks,
> -T

Re: Question: index package in contrib (lucene index)

2009-05-29 Thread Tenaali Ram
Thanks Jun!

On Fri, May 29, 2009 at 2:49 PM, Jun Rao  wrote:

> Reply inlined below.
>
> Jun
> IBM Almaden Research Center
> K55/B1, 650 Harry Road, San Jose, CA  95120-6099
>
> jun...@almaden.ibm.com
>
>
> Tenaali Ram  wrote on 05/28/2009 03:18:53 PM:
>
> > Hi,
> >
> > I am trying to understand the code of index package to build a
> distributed
> > Lucene index. I have some very basic questions and would really
> appreciate
> > if someone can help me understand this code-
> >
> > 1) If I already have Lucene index (divided into shards), should I upload
> > these indexes into HDFS and provide its location or the code will pick
> these
> > shards from local file system ?
>
> Yes, you need to put the old index to HDFS first.
>
> >
> > 2) How is the code adding a document in the lucene index, I can see there
> is
> > a index selection policy. Assuming round robin policy is chosen, how is
> the
> > code adding a document in the lucene index? This is related to first
> > question - is the index where the new document is to be added in HDFS or
> in
> > local file system. I read in the README that the index is first created
> on
> > local file system, then copied back to HDFS. Can someone please point me
> to
> > the code that is doing this.
> >
>
> See contrib.index.example.
>
> > 3) After the map reduce job finishes, where are the final indexes ? In
> HDFS
> > ?
>
> They will be in HDFS.
>
> >
> > 4) Correct me if I am wrong- the code builds multiple indexes, where each
> > index is an instance of Lucene Index having a disjoint subset of
> documents
> > from the corpus. So, if I have to search a term, I have to search each
> index
> > and then merge the result. If this is correct, then how is the IDF of a
> term
> > which is a global statistic computed and updated in each index ? I mean
> each
> > index can compute the IDF wrt. to the subset of documents that it has,
> but
> > can not compute the global IDF of a term (since it knows nothing about
> other
> > indexes, which might have the same term in other documents).
> >
>
> This package only deals with index builds. The shards are disjoint and it's
> up to the index server to calculate the ranks. For distributed TF/IDF
> support, you may want to look into Katta.
>
> > Thanks,
> > -T


Re: SequenceFile and streaming

2009-05-29 Thread Scott Carey
Well, I don't know much about the tar tool at all.  But bz2 is a VERY slow
compression scheme (though quite fascinating to read about how it works).  A
plain tar, or tar.gz will be faster if it is supported.


On 5/28/09 10:10 PM, "walter steffe"  wrote:

> Hi Tom,
> 
>   i have seen the tar-to-seq tool but the person who made it says it is
> very slow:
> "It took about an hour and a half to convert a 615MB tar.bz2 file to an
> 868MB sequence file". To me it is not acceptable.
> Normally to generate a tar file from 615MB od data it take s less then
> one minute. And, in my view the generatin of a sequence file should be
> even simper. You have just to append files and headers without worring
> about hierarchy.
> 
> Regarding the SequenceFileAsTextInputFormat I am not sure it will do the
> job I am looking for.
> The hadoop documentation says: SequenceFileAsTextInputFormat generates
> SequenceFileAsTextRecordReader which converts the input keys and values
> to their String forms by calling toString() method.
> Let we suppose that the keys and values were generated using tar-to-seq
> on a tar archive. Each value is a bytearray that stores the content of a
> file which can be any kind of data (in example a jpeg picture). It
> doesn't make sense to convert this data into a string.
> 
> What is needed is a tool to simply extract the file as with
> tar -xf archive.tar filename. The hadoop framework can be used to
> extract a Java class and you have to do that within a java program. The
> streaming package is meant to be used in a unix shell without the need
> of java programming. But I think it is not very usefull if the
> sequencefile (which is the principal data structure of hadoop) is not
> accessible from a shell command.
> 
> 
> Walter
> 
> 
> 



question about when shuffle/sort start working

2009-05-29 Thread Jianmin Woo
Hi, 
I am being confused by the protocol between mapper and reducer. When mapper 
emitting the (key,value) pair done, is there any signal the mapper send out to 
hadoop framework in protocol to indicate that map is done and the shuffle/sort 
can begin for reducer? If there is no this signal in protocol, when the 
framework begin the shuffle/sort?

Thanks,
Jianmin