date:20080401

Re: one key per output part file

2008-04-01 Thread Ted Dunning


No.  That is a limitation of streaming.


On 4/1/08 6:42 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> This seems like a reasonable solution - but I am using Hadoop streaming and
> byreducer is a perl script. Is it possible to handle side-effect files in
> streaming? I havent found
> anything that indicates that you can...
> 
> Ashish
> 
> On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> 
>> 
>> 
>> Try opening the desired output file in the reduce method.  Make sure that
>> the output files are relative to the correct task specific directory (look
>> for side-effect files on the wiki).
>> 
>> 
>> 
>> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi, I am using Hadoop streaming and I am trying to create a MapReduce
>> that
>>> will generate output where a single key is found in a single output part
>>> file.
>>> Does anyone know how to ensure this condition? I want the reduce task
>> (no
>>> matter how many are specified), to only receive
>>> key-value output from a single key each, process the key-value pairs for
>>> this key, write an output part-XXX file, and only
>>> then process the next key.
>>> 
>>> Here is the task that I am trying to accomplish:
>>> 
>>> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
>>> Output: Each part-XXX should contain the lines of T that contain the
>> word
>>> from line XXX in V.
>>> 
>>> Any help/ideas are appreciated.
>>> 
>>> Ashish
>> 
>>

Re: one key per output part file

2008-04-01 Thread Ashish Venugopal

This seems like a reasonable solution - but I am using Hadoop streaming and
byreducer is a perl script. Is it possible to handle side-effect files in
streaming? I havent found
anything that indicates that you can...

Ashish

On Tue, Apr 1, 2008 at 9:13 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
>
> Try opening the desired output file in the reduce method.  Make sure that
> the output files are relative to the correct task specific directory (look
> for side-effect files on the wiki).
>
>
>
> On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:
>
> > Hi, I am using Hadoop streaming and I am trying to create a MapReduce
> that
> > will generate output where a single key is found in a single output part
> > file.
> > Does anyone know how to ensure this condition? I want the reduce task
> (no
> > matter how many are specified), to only receive
> > key-value output from a single key each, process the key-value pairs for
> > this key, write an output part-XXX file, and only
> > then process the next key.
> >
> > Here is the task that I am trying to accomplish:
> >
> > Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> > Output: Each part-XXX should contain the lines of T that contain the
> word
> > from line XXX in V.
> >
> > Any help/ideas are appreciated.
> >
> > Ashish
>
>

Re: one key per output part file

2008-04-01 Thread Ted Dunning



Try opening the desired output file in the reduce method.  Make sure that
the output files are relative to the correct task specific directory (look
for side-effect files on the wiki).
 


On 4/1/08 5:57 PM, "Ashish Venugopal" <[EMAIL PROTECTED]> wrote:

> Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
> will generate output where a single key is found in a single output part
> file.
> Does anyone know how to ensure this condition? I want the reduce task (no
> matter how many are specified), to only receive
> key-value output from a single key each, process the key-value pairs for
> this key, write an output part-XXX file, and only
> then process the next key.
> 
> Here is the task that I am trying to accomplish:
> 
> Input: Corpus T (lines of text), Corpus V (each line has 1 word)
> Output: Each part-XXX should contain the lines of T that contain the word
> from line XXX in V.
> 
> Any help/ideas are appreciated.
> 
> Ashish

one key per output part file

2008-04-01 Thread Ashish Venugopal

Hi, I am using Hadoop streaming and I am trying to create a MapReduce that
will generate output where a single key is found in a single output part
file.
Does anyone know how to ensure this condition? I want the reduce task (no
matter how many are specified), to only receive
key-value output from a single key each, process the key-value pairs for
this key, write an output part-XXX file, and only
then process the next key.

Here is the task that I am trying to accomplish:

Input: Corpus T (lines of text), Corpus V (each line has 1 word)
Output: Each part-XXX should contain the lines of T that contain the word
from line XXX in V.

Any help/ideas are appreciated.

Ashish

Re: distcp fails :Input source not found

2008-04-01 Thread Prasan Ary

  Here is what is happening directly from the ec2 screen. The ID and Secret Key 
are the only things changed.
  I'm running hadoop 15.3 from the public ami. I launched a 2 machine cluster 
using the ec2 scripts in  the src/contrib/ec2/bin . . .
  The file I try and copy is 9KB (I noticed previous discussion on empty files 
and files that are > 10MB)
   
  > First I make sure that we can copy the file from s3
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -copyToLocal s3://ID:[EMAIL 
PROTECTED]/InputFileFormat.xml /usr/InputFileFormat.xml
  > Now I see that the file is copied to the ec2 master (where I'm logged 
in)
  [EMAIL PROTECTED] hadoop-0.15.3]# dir /usr/Input*
  /usr/InputFileFormat.xml
   
  > Next I make sure I can access the HDFS and that the input directory is 
there
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /
  Found 2 items
  /input  2008-04-01 15:45
  /mnt  2008-04-01 15:42
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop fs -ls /input/
  Found 0 items
   
  > I make sure hadoop is running just fine by running an example
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop jar hadoop-0.15.3-examples.jar 
pi 10 1000
  Number of Maps = 10 Samples per Map = 1000
  Wrote input for Map #0
  Wrote input for Map #1
  Wrote input for Map #2
  Wrote input for Map #3
  Wrote input for Map #4
  Wrote input for Map #5
  Wrote input for Map #6
  Wrote input for Map #7
  Wrote input for Map #8
  Wrote input for Map #9
  Starting Job
  08/04/01 17:38:14 INFO mapred.FileInputFormat: Total input paths to process : 
10
  08/04/01 17:38:14 INFO mapred.JobClient: Running job: job_200804011542_0001
  08/04/01 17:38:15 INFO mapred.JobClient: map 0% reduce 0%
  08/04/01 17:38:22 INFO mapred.JobClient: map 20% reduce 0%
  08/04/01 17:38:24 INFO mapred.JobClient: map 30% reduce 0%
  08/04/01 17:38:25 INFO mapred.JobClient: map 40% reduce 0%
  08/04/01 17:38:27 INFO mapred.JobClient: map 50% reduce 0%
  08/04/01 17:38:28 INFO mapred.JobClient: map 60% reduce 0%
  08/04/01 17:38:31 INFO mapred.JobClient: map 80% reduce 0%
  08/04/01 17:38:33 INFO mapred.JobClient: map 90% reduce 0%
  08/04/01 17:38:34 INFO mapred.JobClient: map 100% reduce 0%
  08/04/01 17:38:43 INFO mapred.JobClient: map 100% reduce 20%
  08/04/01 17:38:44 INFO mapred.JobClient: map 100% reduce 100%
  08/04/01 17:38:45 INFO mapred.JobClient: Job complete: job_200804011542_0001
  08/04/01 17:38:45 INFO mapred.JobClient: Counters: 9
  08/04/01 17:38:45 INFO mapred.JobClient: Job Counters 
  08/04/01 17:38:45 INFO mapred.JobClient: Launched map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Launched reduce tasks=1
  08/04/01 17:38:45 INFO mapred.JobClient: Data-local map tasks=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map-Reduce Framework
  08/04/01 17:38:45 INFO mapred.JobClient: Map input records=10
  08/04/01 17:38:45 INFO mapred.JobClient: Map output records=20
  08/04/01 17:38:45 INFO mapred.JobClient: Map input bytes=240
  08/04/01 17:38:45 INFO mapred.JobClient: Map output bytes=320
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input groups=2
  08/04/01 17:38:45 INFO mapred.JobClient: Reduce input records=20
  Job Finished in 31.028 seconds
  Estimated value of PI is 3.1556
   
  > Finally, I try and copy the file over
  [EMAIL PROTECTED] hadoop-0.15.3]# bin/hadoop distcp s3://ID:[EMAIL 
PROTECTED]/InputFileFormat.xml /input/InputFileFormat.xml
  With failures, global counters are inaccurate; consider running with -i
  Copy failed: org.apache.hadoop.mapred.InvalidInputException: Input source 
s3://ID:[EMAIL PROTECTED]/InputFileFormat.xml does not exist.
  at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:470)
  at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:550)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
  at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:563)
   
   
   
   
   
   
   
   
  


[EMAIL PROTECTED] wrote:
  > That was a typo in my email. I do have s3:// in my command when it fails.


Not sure what's wrong. Your command looks right to me. Would you mind to show 
me the exact error message you see?

Nicholas



   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

What happens if a namenode fails?

2008-04-01 Thread Xavier Stevens

What happens to your data if the namenode fails (hardware failure)?
Assuming you replace it with a fresh box can you restore all of your
data from the slaves?
 
-Xavier

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Tarandeep Singh

thanks Ted for that info [or hack :) ]

I had this directory structure -
  monthData
   |- week1
   |- week2

if I give monthData directory as input path, I get exception -

08/04/01 14:24:30 INFO mapred.FileInputFormat: Total input paths to process : 2
Exception in thread "main" java.io.IOException: Not a file:
hdfs://master:54310/user/hadoop/monthData/week1
at 
org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:170)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:515)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:753)
at 
com.bdc.dod.dashboard.BDCQueryStatsViewer.run(BDCQueryStatsViewer.java:829)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at 
com.bdc.dod.dashboard.BDCQueryStatsViewer.main(BDCQueryStatsViewer.java:796)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.util.RunJar.main(RunJar.java:155)


However, if I give monthData/* as input directory, the jobs run :)

But I feel, the default behavior should be to recurse the parent
directory and get all the files in subdirectories. It is not difficult
to imagine a situation where one want to partition the data in
subdirectories that are kept in one base directory, so that the job
could be run either on base directory or on subdirectory.

-Taran

On Tue, Apr 1, 2008 at 12:04 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  But wildcards that match directories that contain files work well.
>
>
>
>
>  On 4/1/08 10:41 AM, "Peeyush Bishnoi" <[EMAIL PROTECTED]> wrote:
>
>  > Hello ,
>  >
>  > No Hadoop can't  traverse recursively inside subdirectory with Java 
> Map-Reduce
>  > program. It have to be just directory containing files (and no
>  > sub-directories).
>  >
>  >
>  > ---
>  > Peeyush
>  >
>  >
>  > -Original Message-
>  > From: Tarandeep Singh [mailto:[EMAIL PROTECTED]
>  > Sent: Tue 4/1/2008 9:15 AM
>  > To: [EMAIL PROTECTED]
>  > Subject: Hadoop input path - can it have subdirectories
>  >
>  > Hi,
>  >
>  > Can I give a directory (having subdirectories) as input path to Hadoop
>  > Map-Reduce Job.
>  > I tried, but got error.
>  >
>  > Can Hadoop recursively traverse the input directory and collect all
>  > the file names or the input path has to be just a directory containing
>  > files (and no sub-directories) ?
>  >
>  > -Taran
>  >
>
>

Re: distcp fails :Input source not found

2008-04-01 Thread s29752-hadoopuser

> That was a typo in my email. I do have s3:// in my command when it fails.
  

Not sure what's wrong.  Your command looks right to me.  Would you mind to show 
me the exact error message you see?

Nicholas

Re: distcp fails :Input source not found

2008-04-01 Thread Ted Dunning

Are you missing a colon on the first command?

Probably just a transcription error when you composed your email (but I have
made similar mistakes often enough and been unable to see them).

On 4/1/08 1:18 PM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

>   Just to make sure that I am specifying the parameters correctly :
>> bin/hadoop distcp s3//:@/fileone.txt
>> /somefolder_on_hdfs/fileone.txt   : Fails - Input source doesnt exist.
>
>> bin/hadoop fs -copyToLocal s3://:@bucket/fileone.txt
>> /somefolder_on_local/fileone.txt: succeeds
>
>
>   Any correction?
>
>   Thanks.
> 
>
> -
> You rock. That's why Blockbuster's offering you one month of Blockbuster Total
> Access, No Cost.

Re: distcp fails :Input source not found

2008-04-01 Thread Prasan Ary

That was a typo in my email. I do have s3:// in my command when it fails.
   
   
   
  ---

[EMAIL PROTECTED] wrote:
  > bin/hadoop distcp s3//:@/fileone.txt /somefolder_on_hdfs/fileone.txt : 
Fails - Input source doesnt exist.
Should "s3//..." be "s3://..."?

Nicholas





   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

Re: distcp fails :Input source not found

2008-04-01 Thread s29752-hadoopuser

   > bin/hadoop distcp s3//:@/fileone.txt  
/somefolder_on_hdfs/fileone.txt   : Fails - Input source doesnt exist.
Should "s3//..." be "s3://..."?

Nicholas

distcp fails :Input source not found

2008-04-01 Thread Prasan Ary

Hi,
  I am running hadoop 0.15.3  on 2 EC2 instances from a public ami ( 
ami-381df851) . Our input files are on S3. 
   
  When I try to do a distcp for an Input file from  S3 onto hdfs on EC2, the 
copy fails with an error that the file does not exist. However, if I run  
copyToLocal from S3 onto local machine, the copy succeeds, confirming that the 
input file does exist on S3 bucket.
  Furthermore, I can see the Input file on S3 bucket from Mozilla S3 Firefox 
organizer as well.
   
  I am left with no explanation as to why I am getting the File Not Found error 
on S3  when I have every confirmation that the file does exist over there.
   
  Just to make sure that I am specifying the parameters correctly :
   > bin/hadoop distcp s3//:@/fileone.txt  
/somefolder_on_hdfs/fileone.txt   : Fails - Input source doesnt exist.
   
  > bin/hadoop fs -copyToLocal s3://:@bucket/fileone.txt 
/somefolder_on_local/fileone.txt: succeeds
   
   
  Any correction?
   
  Thanks.

   
-
You rock. That's why Blockbuster's offering you one month of Blockbuster Total 
Access, No Cost.

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Ted Dunning


But wildcards that match directories that contain files work well.


On 4/1/08 10:41 AM, "Peeyush Bishnoi" <[EMAIL PROTECTED]> wrote:

> Hello ,
> 
> No Hadoop can't  traverse recursively inside subdirectory with Java Map-Reduce
> program. It have to be just directory containing files (and no
> sub-directories).
> 
> 
> ---
> Peeyush
> 
> 
> -Original Message-
> From: Tarandeep Singh [mailto:[EMAIL PROTECTED]
> Sent: Tue 4/1/2008 9:15 AM
> To: [EMAIL PROTECTED]
> Subject: Hadoop input path - can it have subdirectories
>  
> Hi,
> 
> Can I give a directory (having subdirectories) as input path to Hadoop
> Map-Reduce Job.
> I tried, but got error.
> 
> Can Hadoop recursively traverse the input directory and collect all
> the file names or the input path has to be just a directory containing
> files (and no sub-directories) ?
> 
> -Taran
>

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Norbert Burger

Sorry, I was wrong.  I just checked my installation, and by default,
streaming appears to work as people have described -- it doesn't recurse
subdirectories.  If I pass a directory containing only directories as the
-input parameter, I get the following error:

08/04/01 15:00:56 ERROR streaming.StreamJob: Error Launching job : Not a
file: hdfs://10.188.239.122:9000/kpi2/kpi3
Streaming Job Failed!

It will, however, enumerate all files in the input directory, provided the
input directory contains only files.  I think that's what confused me.

On Tue, Apr 1, 2008 at 1:51 PM, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Peeyush Bishnoi wrote:
> > Hello ,
> >
> > No Hadoop can't  traverse recursively inside subdirectory with Java
> > Map-Reduce program. It have to be just directory containing files
> > (and no sub-directories).
>
> That's not the case.
>
> This is actually a characteristic of the InputFormat that you're using.
> Hadoop reads data using InputFormat-s, and standard implementations may
> indeed not support subdirectories - but if you need this functionality
> you can implement your own InputFormat.
>
>
> --
> Best regards,
> Andrzej Bialecki <><
>  ___. ___ ___ ___ _ _   __
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Very basic question about sharing loading to difference machines

2008-04-01 Thread Carfield Yim

Hi , if
we try to make use of hadoop to distribute the load of my application
to difference hosts. However, as we already
identified the bottleneck of the system is database access, just use map
reduce to load balance the application properly doesn't help
much, can we use Map reduce to also distribute loading of DB layer?

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Andrzej Bialecki


Peeyush Bishnoi wrote:

Hello ,

No Hadoop can't  traverse recursively inside subdirectory with Java
Map-Reduce program. It have to be just directory containing files
(and no sub-directories).


That's not the case.

This is actually a characteristic of the InputFormat that you're using.
Hadoop reads data using InputFormat-s, and standard implementations may
indeed not support subdirectories - but if you need this functionality
you can implement your own InputFormat.


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Hadoop input path - can it have subdirectories

2008-04-01 Thread Peeyush Bishnoi

Hello ,

No Hadoop can't  traverse recursively inside subdirectory with Java Map-Reduce 
program. It have to be just directory containing files (and no 
sub-directories).  


---
Peeyush


-Original Message-
From: Tarandeep Singh [mailto:[EMAIL PROTECTED]
Sent: Tue 4/1/2008 9:15 AM
To: [EMAIL PROTECTED]
Subject: Hadoop input path - can it have subdirectories
 
Hi,

Can I give a directory (having subdirectories) as input path to Hadoop
Map-Reduce Job.
I tried, but got error.

Can Hadoop recursively traverse the input directory and collect all
the file names or the input path has to be just a directory containing
files (and no sub-directories) ?

-Taran

RE: Hadoop input path - can it have subdirectories

2008-04-01 Thread Jeff Eastman

My experience running with the Java API is that subdirectories in the input
path do cause an exception, so the streaming file input processing must be
different. 

Jeff Eastman
 

> -Original Message-
> From: Norbert Burger [mailto:[EMAIL PROTECTED]
> Sent: Tuesday, April 01, 2008 9:46 AM
> To: core-user@hadoop.apache.org
> Cc: [EMAIL PROTECTED]
> Subject: Re: Hadoop input path - can it have subdirectories
> 
> Yes, this is fine, at least for Hadoop Streaming.  I specify the root of
> my
> logs directory as my -input parameter, and Hadoop correctly finds all of
> child directories.  What's the error you're seeing?  Is a stack trace
> available?
> 
> Norbert
> 
> On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <[EMAIL PROTECTED]>
> wrote:
> 
> > Hi,
> >
> > Can I give a directory (having subdirectories) as input path to Hadoop
> > Map-Reduce Job.
> > I tried, but got error.
> >
> > Can Hadoop recursively traverse the input directory and collect all
> > the file names or the input path has to be just a directory containing
> > files (and no sub-directories) ?
> >
> > -Taran
> >

Re: Hadoop input path - can it have subdirectories

2008-04-01 Thread Norbert Burger

Yes, this is fine, at least for Hadoop Streaming.  I specify the root of my
logs directory as my -input parameter, and Hadoop correctly finds all of
child directories.  What's the error you're seeing?  Is a stack trace
available?

Norbert

On Tue, Apr 1, 2008 at 12:15 PM, Tarandeep Singh <[EMAIL PROTECTED]>
wrote:

> Hi,
>
> Can I give a directory (having subdirectories) as input path to Hadoop
> Map-Reduce Job.
> I tried, but got error.
>
> Can Hadoop recursively traverse the input directory and collect all
> the file names or the input path has to be just a directory containing
> files (and no sub-directories) ?
>
> -Taran
>

Hadoop input path - can it have subdirectories

2008-04-01 Thread Tarandeep Singh

Hi,

Can I give a directory (having subdirectories) as input path to Hadoop
Map-Reduce Job.
I tried, but got error.

Can Hadoop recursively traverse the input directory and collect all
the file names or the input path has to be just a directory containing
files (and no sub-directories) ?

-Taran

Re: Nutch and Distributed Lucene

2008-04-01 Thread Ning Li

Hi,

Nutch builds Lucene indexes. But Nutch is much more than that. It is a
web search application software that crawls the web, inverts links and
builds indexes. Each step is one or more Map/Reduce jobs. You can find
more information at http://lucene.apache.org/nutch/

The Map/Reduce job to build Lucene indexes in Nutch is customized to
the data schema/structures used in Nutch. The index contrib package in
Hadoop provides a general/configurable process to build Lucene indexes
in parallel using a Map/Reduce job. That's the main difference. There
is also the difference that the index build job in Nutch builds
indexes in reduce tasks, while the index contrib package builds
indexes in both map and reduce tasks and there are advantages in doing
that...

Regards,
Ning

On 4/1/08, Naama Kraus <[EMAIL PROTECTED]> wrote:
> Hi,
>
> I'd like to know if Nutch is running on top of Lucene, or is it non related
> to Lucene. I.e. indexing, parsing, crawling, internal data structures ... -
> all written from scratch using MapReduce (my impression) ?
>
> What is the relation between Nutch and the distributed Lucene patch that was
> inserted lately into Hadoop ?
>
> Thanks for any enlightening,
> Naama
>
> --
> oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
> 00 oo 00 oo
> "If you want your children to be intelligent, read them fairy tales. If you
> want them to be more intelligent, read them more fairy tales." (Albert
> Einstein)
>

Re: Performance impact of underlying file system?

2008-04-01 Thread Ted Dunning

I would expect that most file systems can saturate the disk bandwidth for
the large sequential reads that hadoop does.

We use ext3 with good results.

On 4/1/08 8:08 AM, "Colin Freas" <[EMAIL PROTECTED]> wrote:

> Is the performance of Hadoop impacted by the underlying file system on the
> nodes at all?
> 
> All my nodes are ext3.  I'm wondering if using XFS, Reiser, or ZFS might
> improve performance.
> 
> Does anyone have any offhand knowledge about this?
> 
> -Colin

Hadoop Reduce Problem

2008-04-01 Thread Natarajan, Senthil

Hi,

I am having problem in reduce if the firewall is enabled.



I am using default settings from hadoop-default.xml and hadoop-site.xml

And I just changed this port number

mapred.task.tracker.report.address

  



I created the firewall rule to allow port range 5:50100 between the slaves 
and master.



But reduce on the slaves using some other ports seems. So Reduce always hangs 
with firewall enabled. If I disable the firewall it works fine.

Could you please let me know what I am missing or where to control the hadoop 
random port creation?



Thanks,
Senthil

Performance impact of underlying file system?

2008-04-01 Thread Colin Freas

Is the performance of Hadoop impacted by the underlying file system on the
nodes at all?

All my nodes are ext3.  I'm wondering if using XFS, Reiser, or ZFS might
improve performance.

Does anyone have any offhand knowledge about this?

-Colin

Nutch and Distributed Lucene

2008-04-01 Thread Naama Kraus

Hi,

I'd like to know if Nutch is running on top of Lucene, or is it non related
to Lucene. I.e. indexing, parsing, crawling, internal data structures ... -
all written from scratch using MapReduce (my impression) ?

What is the relation between Nutch and the distributed Lucene patch that was
inserted lately into Hadoop ?

Thanks for any enlightening,
Naama

-- 
oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo
00 oo 00 oo
"If you want your children to be intelligent, read them fairy tales. If you
want them to be more intelligent, read them more fairy tales." (Albert
Einstein)

Re: Performance / cluster scaling question

2008-04-01 Thread André Martin


Hi Chris & Hadoopers,
we changed our system architecture in that way so that most of the data 
is now streamed directly from the spiders/crawlers nodes instead of 
using/creating temporary files on the DFS - now it performs way better 
and the exceptions are gone :-) ...seems to be a good decision when 
having only a relatively small cluster (like ours w/ 8 data nodes) where 
the deletion of blocks seems not to catch up with the creation of new 
temp files (through the max 100 blocks/3 seconds deletion "restriction").


Cu on the 'net,
 Bye - bye,

< André   èrbnA >

Chris K Wensel wrote:
If it's any consolation, I'm seeing similar behaviors on 0.16.0 when 
running on EC2 when I push the number of nodes in the cluster past 40.


On Mar 24, 2008, at 6:31 AM, André Martin wrote:

Thanks for the clarification, dhruba :-)
Anyway, what can cause those other exceptions such as  "Could not get 
block locations" and "DataXceiver: java.io.EOFException"? Can anyone 
give me a little more insight about those exceptions?
And does anyone have a similar workload (frequent writes and deletion 
of small files), and what could cause the performance degradation 
(see first post)?  I think HDFS should be able to handle two million 
and more files/blocks...
Also, I observed that some of my datanodes do not "heartbeat" to the 
namenode for several seconds (up to 400 :-() from time to time - when 
I check those specific datanodes and do a "top", I see the "du" 
command running that seems to got stuck?!?

Thanks and Happy Easter :-)

Cu on the 'net,
  Bye - bye,

 < André   èrbnA >

dhruba Borthakur wrote:

The namenode lazily instructs a Datanode to delete blocks. As a 
response to every heartbeat from a Datanode, the Namenode instructs 
it to delete a maximum on 100 blocks. Typically, the heartbeat 
periodicity is 3 seconds. The heartbeat thread in the Datanode 
deletes the block files synchronously before it can send the next 
heartbeat. That's the reason a small number (like 100) was chosen.


If you have 8 datanodes, your system will probably delete about 800 
blocks every 3 seconds.


Thanks,
dhruba

-Original Message-
From: André Martin [mailto:[EMAIL PROTECTED] Sent: Friday, March 
21, 2008 3:06 PM

To: core-user@hadoop.apache.org
Subject: Re: Performance / cluster scaling question

After waiting a few hours (without having any load), the block 
number and "DFS Used" space seems to go down...
My question is: is the hardware simply too weak/slow to send the 
block deletion request to the datanodes in a timely manner, or do 
simply those "crappy" HDDs cause the delay, since I noticed that I 
can take up to 40 minutes when deleting ~400.000 files at once 
manually using "rm -r"...
Actually - my main concern is why the performance à la the 
throughput goes down - any ideas?

Re: small sized files - how to use MultiInputFileFormat

2008-04-01 Thread Enis Soztutar


Hi,

An example extracting one record per file would be :

public class FooInputFormat extends MultiFileInputFormat {

  @Override
  public RecordReader getRecordReader(InputSplit split, JobConf job, 
Reporter reporter) throws IOException {

return new FooRecordReader(job, (MultiFileSplit)split);
  }
}


public static class FooRecordReader implements RecordReader {

  private MultiFileSplit split;
  private long offset;
  private long totLength;
  private FileSystem fs;
  private int count = 0;
  private Path[] paths;
public FooRecordReader(Configuration conf, MultiFileSplit split)
  throws IOException {
this.split = split;
fs = FileSystem.get(conf);
this.paths = split.getPaths();
this.totLength = split.getLength();
this.offset = 0;
  }

  public WritableComparable createKey() {
..
  }

  public Writable createValue() {
..
  }

  public void close() throws IOException { }

  public long getPos() throws IOException {
return offset;
  }

  public float getProgress() throws IOException {
return ((float)offset) / split.getLength();
  }

  public boolean next(Writable key, Writable value) throws IOException {
if(offset >= totLength)
  return false;
if(count >= split.numPaths())
  return false;
  
Path file = paths[count];

FSDataInputStream stream = fs.open(file);
BufferedReader reader = new BufferedReader(new 
InputStreamReader(stream));

Scanner scanner = new Scanner(reader.readLine());
  //read from file, fill in key and value
   reader.close();
stream.close();
offset += split.getLength(count);
count++;
return true;
  }
}


I guess, I should add an example code to the mapred tutorial, and 
examples directory.


Jason Curtes wrote:

Hello,

I have been trying to run Hadoop on a set of small text files, not larger
than 10k each. The total input size is 15MB. If I try to run the example
word count application, it takes about 2000 seconds, more than half an hour
to complete. However, if I merge all the files into one large file, it takes
much less than a minute. I think using MultiInputFileFormat can be helpful
at this point. However, the API documentation is not really helpful. I
wonder if MultiInputFileFormat can really solve my problem, and if so, can
you suggest me a reference on how to use it, or a few lines to be added to
the word count example to make things more clear?

Thanks in advance.

Regards,

Jason Curtes

Re: small sized files - how to use MultiInputFileFormat

2008-04-01 Thread Alejandro Abdelnur

In principle I agree with you Ted.

However, in many cases we have multiple large jobs generating outputs
that are not that big and as result the number of small size files
(significantly smaller than a Hadoop block) is large, using the
default splitting logic there triggers jobs with a large amount of
tasks that inefficiently clogs the cluster.

The MultipleFileInputFormat helps to avoid that, but it has a problem,
if the file set is a mix of small and large files the splits are
uneven and it does not do split on single large files.

To address this we've written our own InputFormat (for Text and
SequenceFiles) that collapses small files into a splits up to the
block size and splits big files into the block size.

It has a  twist that you can you specify the max number of MAPs that
you want or the BLOCK size you want to use for the splits.

When a particular split contains multiple small files, the suggested
host for the splits is order based on the host that has most of the
data for those files.

We'll still have to do some clean up on the code and then we'll submit
it to Hadoop.

A

On Sat, Mar 29, 2008 at 10:20 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  Small files are a bad idea for high throughput no matter what technology you
>  use.  The issue is that you need a larger file in order to avoid disk seeks.
>
>
>
>
>  On 3/28/08 7:34 PM, "Jason Curtes" <[EMAIL PROTECTED]> wrote:
>
>  > Hello,
>  >
>  > I have been trying to run Hadoop on a set of small text files, not larger
>  > than 10k each. The total input size is 15MB. If I try to run the example
>  > word count application, it takes about 2000 seconds, more than half an hour
>  > to complete. However, if I merge all the files into one large file, it 
> takes
>  > much less than a minute. I think using MultiInputFileFormat can be helpful
>  > at this point. However, the API documentation is not really helpful. I
>  > wonder if MultiInputFileFormat can really solve my problem, and if so, can
>  > you suggest me a reference on how to use it, or a few lines to be added to
>  > the word count example to make things more clear?
>  >
>  > Thanks in advance.
>  >
>  > Regards,
>  >
>  > Jason Curtes
>
>

Re: one key per output part file

Re: one key per output part file

Re: one key per output part file

one key per output part file

Re: distcp fails :Input source not found

What happens if a namenode fails?

Re: Hadoop input path - can it have subdirectories

Re: distcp fails :Input source not found

Re: distcp fails :Input source not found

Re: distcp fails :Input source not found

Re: distcp fails :Input source not found

distcp fails :Input source not found

Re: Hadoop input path - can it have subdirectories

Re: Hadoop input path - can it have subdirectories

Very basic question about sharing loading to difference machines

Re: Hadoop input path - can it have subdirectories

RE: Hadoop input path - can it have subdirectories

RE: Hadoop input path - can it have subdirectories

Re: Hadoop input path - can it have subdirectories

Hadoop input path - can it have subdirectories

Re: Nutch and Distributed Lucene

Re: Performance impact of underlying file system?

Hadoop Reduce Problem

Performance impact of underlying file system?

Nutch and Distributed Lucene

Re: Performance / cluster scaling question

Re: small sized files - how to use MultiInputFileFormat

Re: small sized files - how to use MultiInputFileFormat

28 matches

Site Navigation

Mail list logo

Footer information