Grep example does not run with eclipse ..

2008-03-12 Thread Pritam Kanchan

Hi,

I am new to this group.
I have compiled hadoop-0.15.3 with eclipse on my Fedora system .
*I am able to run the example for grep through command prompt successfully.*
(bin/hadoop jar hadoop-0.15.3-examples.jar grep Wdfs/  Woutput/  hey)

But , it gives me following error while when I try to run through eclipse...
I use run dialog box of eclipse and specify Main class 
(org.apache.hadoop.examples.Grep) and program arguments correctly(Wdfs/  
Woutput/  hey).

Please help


2008-03-13 11:24:59,639 INFO  jvm.JvmMetrics (JvmMetrics.java:init(67)) 
- Initializing JVM Metrics with processName=JobTracker, sessionId=
2008-03-13 11:24:59,782 INFO  mapred.FileInputFormat 
(FileInputFormat.java:validateInput(153)) - Total input paths to process : 1
2008-03-13 11:24:59,978 WARN  conf.Configuration 
(Configuration.java:loadResource(907)) - 
build/test/mapred/local/localRunner/job_local_1.xml:a attempt to 
override final parameter: hadoop.tmp.dir;  Ignoring.
2008-03-13 11:24:59,982 INFO  mapred.JobClient 
(JobClient.java:runJob(811)) - Running job: job_local_1
2008-03-13 11:25:00,985 INFO  mapred.JobClient 
(JobClient.java:runJob(834)) -  map 0% reduce 0%
2008-03-13 11:25:11,587 INFO  mapred.MapTask (MapTask.java:run(174)) - 
numReduceTasks: 1
2008-03-13 11:25:17,793 WARN  mapred.LocalJobRunner 
(LocalJobRunner.java:run(223)) - job_local_1

java.lang.NullPointerException
   at 
org.apache.hadoop.io.serializer.SerializationFactory.(SerializationFactory.java:52)
   at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.(MapTask.java:325)

   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:177)
   at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:150)

Exception in thread "main" java.io.IOException: Job failed!
   at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:900)
   at org.apache.hadoop.examples.Grep.run(Grep.java:71)
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
   at org.apache.hadoop.examples.Grep.main(Grep.java:97)

The information contained in this electronic message and any attachments to 
this message are intended for the exclusive use of the addressee(s) and may 
contain proprietary, confidential or privileged information. If you are not the 
intended recipient, you should not disseminate, distribute or copy this e-mail. 
Please notify the sender immediately and destroy all copies of this message and 
any attachments contained in it.

Contact your Administrator for further information.



Re: Pipes example wordcount-nopipe.cc failed when reading from input splits

2008-03-12 Thread 11 Nov.
I tried to specify "WordCountInputFormat" as the input format, here is the
command line:

bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml  -input
inputdata/ -output outputdata -inputformat
org.apache.hadoop.mapred.pipes.WordCountInputFormat

The process of mapreduce seems not really executed and I only get such
output on screen:

08/03/13 13:17:44 WARN mapred.JobClient: No job jar file set.  User classes
may not be found. See JobConf(Class) or JobConf#setJar(String).
08/03/13 13:17:45 INFO mapred.JobClient: Running job: job_200803131137_0004
08/03/13 13:17:46 INFO mapred.JobClient:  map 100% reduce 100%
08/03/13 13:17:47 INFO mapred.JobClient: Job complete: job_200803131137_0004
08/03/13 13:17:47 INFO mapred.JobClient: Counters: 0

What should be the problem then?

In the former discussion, Owen said:
The nopipe example needs more documentation.  It assumes that it is
run with the InputFormat from src/test/org/apache/hadoop/mapred/pipes/
WordCountInputFormat.java, which has a very specific input split
format. By running with a TextInputFormat, it will send binary bytes
as the input split and won't work right. The nopipe example should
probably be recoded to use libhdfs too, but that is more complicated
to get running as a unit test. Also note that since the C++ example
is using local file reads, it will only work on a cluster if you have
nfs or something working across the cluster.

Can anybody give more details?

2008/3/7, 11 Nov. <[EMAIL PROTECTED]>:
>
> Thanks a lot!
>
> 2008/3/4, Amareshwari Sri Ramadasu <[EMAIL PROTECTED]>:
> >
> > Hi,
> >
> > Here is some discussion on how to run wordcount-nopipe :
> > http://www.nabble.com/pipe-application-error-td13840804.html
> > Probably makes sense for your question.
> >
> > Thanks
> >
> > Amareshwari
> >
> > 11 Nov. wrote:
> > > I traced into the c++ recordreader code:
> > >   WordCountReader(HadoopPipes::MapContext& context) {
> > > std::string filename;
> > > HadoopUtils::StringInStream stream(context.getInputSplit());
> > > HadoopUtils::deserializeString(filename, stream);
> > > struct stat statResult;
> > > stat(filename.c_str(), &statResult);
> > > bytesTotal = statResult.st_size;
> > > bytesRead = 0;
> > > cout << filename< > > file = fopen(filename.c_str(), "rt");
> > > HADOOP_ASSERT(file != NULL, "failed to open " + filename);
> > >   }
> > >
> > > I got nothing for the filename virable, which showed the InputSplit is
> > > empty.
> > >
> > > 2008/3/4, 11 Nov. <[EMAIL PROTECTED]>:
> > >
> > >> hi colleagues,
> > >>I have set up the single node cluster to test pipes examples.
> > >>wordcount-simple and wordcount-part work just fine. but
> > >> wordcount-nopipe can't run. Here is my commnad line:
> > >>
> > >>  bin/hadoop pipes -conf src/examples/pipes/conf/word-nopipe.xml-input
> > >> input/ -output out-dir-nopipe1
> > >>
> > >> and here is the error message printed on my console:
> > >>
> > >> 08/03/03 23:23:06 WARN mapred.JobClient: No job jar file set.  User
> > >> classes may not be found. See JobConf(Class) or
> > JobConf#setJar(String).
> > >> 08/03/03 23:23:06 INFO mapred.FileInputFormat: Total input paths to
> > >> process : 1
> > >> 08/03/03 23:23:07 INFO mapred.JobClient: Running job:
> > >> job_200803032218_0004
> > >> 08/03/03 23:23:08 INFO mapred.JobClient:  map 0% reduce 0%
> > >> 08/03/03 23:23:11 INFO mapred.JobClient: Task Id :
> > >> task_200803032218_0004_m_00_0, Status : FAILED
> > >> java.io.IOException: pipe child exception
> > >> at org.apache.hadoop.mapred.pipes.Application.abort(
> > >> Application.java:138)
> > >> at org.apache.hadoop.mapred.pipes.PipesMapRunner.run(
> > >> PipesMapRunner.java:83)
> > >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >> at org.apache.hadoop.mapred.TaskTracker$Child.main(
> > >> TaskTracker.java:1787)
> > >> Caused by: java.io.EOFException
> > >> at java.io.DataInputStream.readByte(DataInputStream.java:250)
> > >> at org.apache.hadoop.io.WritableUtils.readVLong(
> > WritableUtils.java
> > >> :313)
> > >> at org.apache.hadoop.io.WritableUtils.readVInt(
> > WritableUtils.java
> > >> :335)
> > >> at
> > >> org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(
> > >> BinaryProtocol.java:112)
> > >>
> > >> task_200803032218_0004_m_00_0:
> > >> task_200803032218_0004_m_00_0:
> > >> task_200803032218_0004_m_00_0:
> > >> task_200803032218_0004_m_00_0: Hadoop Pipes Exception: failed to
> > open
> > >> at /home/hadoop/hadoop-0.15.2-single-cluster
> > >> /src/examples/pipes/impl/wordcount-nopipe.cc:67 in
> > >> WordCountReader::WordCountReader(HadoopPipes::MapContext&)
> > >>
> > >>
> > >> Could anybody tell me how to fix this? That will be appreciated.
> > >> Thanks a lot!
> > >>
> > >>
> > >
> > >
> >
> >
>


Re: hadoop dfs -ls command not working

2008-03-12 Thread Amar Kamat
Assuming that you are using HADOOP in the distributed mode.
On Thu, 13 Mar 2008, christopher pax wrote:

> i run something like this:
> $: bin/hadoop dfs -ls /home/cloud/wordcount/input/
This path should exist in the dfs (i.e HADOOP's filesystem) and not on the
local filesystem. Looking at the jar file (see below) I assume that you
are trying to give it a local filesystem path. Put the file in the dfs
using 'bin/hadoop dfs -put' and then provide the path in the dfs as the
souce and the target. In case of 'stand alone' mode this should work.
Amar
> and get this:
> ls: Could not get listing for /home/cloud/wordcount/input
>
>
> the file input does exists in that directory listing
>
> there are 2 documents in that file. file01 and file02 both which has text in 
> it.
>
> what i am doing is running the word count example from
> http://hadoop.apache.org/core/docs/r0.16.0/mapred_tutorial.html
> the program compiles fine.
>
> running the dfs command in the example are not working.
> this is not working for me either:
> $: bin/hadoop jar /home/cloud/wordcount.jar org.myorg.WordCount
^ ^ ^ ^ ^ ^
> /home/cloud/wordcount/input /home/cloud/wordcount/output
>
> hope you guys can help,
> thanks
>


hadoop dfs -ls command not working

2008-03-12 Thread christopher pax
i run something like this:
$: bin/hadoop dfs -ls /home/cloud/wordcount/input/
and get this:
ls: Could not get listing for /home/cloud/wordcount/input


the file input does exists in that directory listing

there are 2 documents in that file. file01 and file02 both which has text in it.

what i am doing is running the word count example from
http://hadoop.apache.org/core/docs/r0.16.0/mapred_tutorial.html
the program compiles fine.

running the dfs command in the example are not working.
this is not working for me either:
$: bin/hadoop jar /home/cloud/wordcount.jar org.myorg.WordCount
/home/cloud/wordcount/input /home/cloud/wordcount/output

hope you guys can help,
thanks


Re: file permission problem

2008-03-12 Thread Johannes Zillmann

Hi Nicholas,

i'm using the 0.16.0 distribution.

Johannes



[EMAIL PROTECTED] wrote:

Hi Johannes,

Which version of hadoop are you using?  There is a known bug in some nightly 
builds.

Nicholas


- Original Message 
From: Johannes Zillmann <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a 
remote hadoop cluster.

After the job finished i do some file operations on the generated output.
The "cluster-user" is different to the "laptop-user". As output i 
specify a directory inside the users home. This output directory, 
created through the map-reduce job has "cluster-user" permissions, so 
this does not allow me to move or delete the output folder with my 
"laptop-user".


So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the 
output folder):

- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf instantiation 
on the client side

- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does not work 
recursiv)
- setPermision(777) on the output folder on the client side (permission 
denied)
- create the output folder before running the job (Output directory 
already exists exception)


None of the things i tried worked. Is there a way to achieve what i want ?
Any ideas appreciated!

cheers
Johannes


  



--
~~~ 
101tec GmbH


Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com



Re: file permission problem

2008-03-12 Thread s29752-hadoopuser
Hi Johannes,

Which version of hadoop are you using?  There is a known bug in some nightly 
builds.

Nicholas


- Original Message 
From: Johannes Zillmann <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Wednesday, March 12, 2008 5:47:27 PM
Subject: file permission problem

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a 
remote hadoop cluster.
After the job finished i do some file operations on the generated output.
The "cluster-user" is different to the "laptop-user". As output i 
specify a directory inside the users home. This output directory, 
created through the map-reduce job has "cluster-user" permissions, so 
this does not allow me to move or delete the output folder with my 
"laptop-user".

So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the 
output folder):
- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf instantiation 
on the client side
- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does not work 
recursiv)
- setPermision(777) on the output folder on the client side (permission 
denied)
- create the output folder before running the job (Output directory 
already exists exception)

None of the things i tried worked. Is there a way to achieve what i want ?
Any ideas appreciated!

cheers
Johannes


-- 
~~~ 
101tec GmbH

Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com






file permission problem

2008-03-12 Thread Johannes Zillmann

Hi,

i have a question regarding the file permissions.
I have a kind of workflow where i submit a job from my laptop to a 
remote hadoop cluster.

After the job finished i do some file operations on the generated output.
The "cluster-user" is different to the "laptop-user". As output i 
specify a directory inside the users home. This output directory, 
created through the map-reduce job has "cluster-user" permissions, so 
this does not allow me to move or delete the output folder with my 
"laptop-user".


So it looks as follow:
/user/jz/  rwxrwxrwx jzsupergroup
/user/jz/output   rwxr-xr-xhadoopsupergroup

I tried different things to achieve what i want (moving/deleting the 
output folder):

- jobConf.setUser("hadoop") on the client side
- System.setProperty("user.name","hadoop") before jobConf instantiation 
on the client side

- add user.name node in the hadoop-site.xml on the client side
- setPermision(777) on the home folder on the client side (does not work 
recursiv)
- setPermision(777) on the output folder on the client side (permission 
denied)
- create the output folder before running the job (Output directory 
already exists exception)


None of the things i tried worked. Is there a way to achieve what i want ?
Any ideas appreciated!

cheers
Johannes


--
~~~ 
101tec GmbH


Halle (Saale), Saxony-Anhalt, Germany
http://www.101tec.com



Re: HDFS interface

2008-03-12 Thread Hairong Kuang
If you add the configuration directory to the class path, the configuration
files will be automatically loaded.

Hairong


On 3/12/08 5:32 PM, "Cagdas Gerede" <[EMAIL PROTECTED]> wrote:

> I found the solution.  Please let me know if you have a better idea.
> 
> I added the following addResource lines.
> 
> Configuration conf = new Configuration();
> 
> conf.addResource(new Path("location_of_hadoop-default.xml"));
> conf.addResource(new Path("location_of_hadoop-site.xml"));
> 
> FileSystem fs = FileSystem.get(conf);
> 
> (Would be good to update the wiki page).
> 
> - CEG
> 
> 
> On Wed, Mar 12, 2008 at 5:04 PM, Cagdas Gerede <[EMAIL PROTECTED]>
> wrote:
> 
>> I see the following paragraphs in the wiki (
>> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)> rg/hadoop/HadoopDfsReadWriteExample>
>> 
>>> Create a [image: [WWW]]
>>> FileSystem>> .html>instance by passing a new Configuration object. Please note that the
>> following example code assumes that the >Configuration object will
>> automatically load the *hadoop-default.xml* and
>> *hadoop-site.xml*configuration files. You may need to explicitly add these
>> resource paths if
>> you are not running inside of the Hadoop runtime environment.
>> 
>> and
>> 
>>> Configuration conf = new Configuration();
>>>FileSystem fs = FileSystem.get(conf);
>> 
>> When I do
>> 
>> Path[] apples = fs.globPaths(new Path("*"));
>> for(Path apple : apples) {
>> System.out.println(apple);
>> }
>> 
>> 
>> It prints out all the local file names.
>> 
>> How do I point my application to running HDFS instance?
>> What does "explicitly add these resource paths if you are not running
>> inside of the Hadoop runtime environment." mean?
>> 
>> Thanks,
>> 
>> - CEG
>> 
>> 
>> 
>> 
> 



Re: HDFS interface

2008-03-12 Thread Cagdas Gerede
I found the solution.  Please let me know if you have a better idea.

I added the following addResource lines.

Configuration conf = new Configuration();

conf.addResource(new Path("location_of_hadoop-default.xml"));
conf.addResource(new Path("location_of_hadoop-site.xml"));

FileSystem fs = FileSystem.get(conf);

(Would be good to update the wiki page).

- CEG


On Wed, Mar 12, 2008 at 5:04 PM, Cagdas Gerede <[EMAIL PROTECTED]>
wrote:

> I see the following paragraphs in the wiki (
> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)
>
> >Create a [image: [WWW]] 
> >FileSysteminstance
> > by passing a new Configuration object. Please note that the
> following example code assumes that the >Configuration object will
> automatically load the *hadoop-default.xml* and 
> *hadoop-site.xml*configuration files. You may need to explicitly add these 
> resource paths if
> you are not running inside of the Hadoop runtime environment.
>
> and
>
> > Configuration conf = new Configuration();
> >FileSystem fs = FileSystem.get(conf);
>
> When I do
>
> Path[] apples = fs.globPaths(new Path("*"));
> for(Path apple : apples) {
> System.out.println(apple);
> }
>
>
> It prints out all the local file names.
>
> How do I point my application to running HDFS instance?
> What does "explicitly add these resource paths if you are not running
> inside of the Hadoop runtime environment." mean?
>
> Thanks,
>
> - CEG
>
>
>
>


-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://www.cs.ucsb.edu/~gerede
Pronunciation: http://www.cs.ucsb.edu/~gerede/cagdas.html


Re: HDFS interface

2008-03-12 Thread Cagdas Gerede
I see the following paragraphs in the wiki (
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample)

>Create a [image: [WWW]]
FileSysteminstance
by passing a new Configuration object. Please note that the
following example code assumes that the >Configuration object will
automatically load the *hadoop-default.xml* and
*hadoop-site.xml*configuration files. You may need to explicitly add
these resource paths if
you are not running inside of the Hadoop runtime environment.

and

> Configuration conf = new Configuration();
>FileSystem fs = FileSystem.get(conf);

When I do

Path[] apples = fs.globPaths(new Path("*"));
for(Path apple : apples) {
System.out.println(apple);
}


It prints out all the local file names.

How do I point my application to running HDFS instance?
What does "explicitly add these resource paths if you are not running inside
of the Hadoop runtime environment." mean?

Thanks,

- CEG


Re: scaling experiments on a static cluster?

2008-03-12 Thread Ted Dunning

Yes.

Increase the replication.  Wait.  Drop the replication.


On 3/12/08 3:44 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:

> Thanks-- that should work.  I'll follow up with the cluster
> administrators to see if I can get this to happen.  To rebalance the
> file storage can I just set the replication factor using "hadoop dfs"?
> Chris
> 
> On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>> 
>>  What about just taking down half of the nodes and then loading your data
>>  into the remainder?  Should take about 20 minutes each time you remove nodes
>>  but only a few seconds each time you add some.  Remember that you need to
>>  reload the data each time (or rebalance it if growing the cluster) to get
>>  realistic numbers.
>> 
>>  My suggested procedure would be to take all but 2 nodes down, and then
>> 
>>  - run test
>>  - double number of nodes
>>  - rebalance file storage
>>  - lather, rinse, repeat
>> 
>> 
>> 
>> 
>>  On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:
>> 
>>> Hi Hadoop mavens-
>>> I'm hoping someone out there will have a quick solution for me.  I'm
>>> trying to run some very basic scaling experiments for a rapidly
>>> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>>> with 2 procs/node.  Ideally, I would want to run my code on clusters
>>> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>>> The problem is that I am not able to reconfigure the cluster (in the
>>> long run, i.e., before a final version of the paper, I assume this
>>> will be possible, but for now it's not).  Setting the number of
>>> mappers/reducers does not seem to be a viable option, at least not in
>>> the trivial way, since the physical layout of the input files makes
>>> hadoop run different tasks of processes than I may request (most of my
>>> jobs consist of multiple MR steps, the initial one always running on a
>>> relatively small data set, which fits into a single block, and
>>> therefore the Hadoop framework does honor my task number request on
>>> the first job-- but during the later ones it does not).
>>> 
>>> My questions:
>>> 1) can I get around this limitation programmatically?  I.e., is there
>>> a way to tell the framework to only use a subset of the nodes for DFS
>>> / mapping / reducing?
>>> 2) if not, what statistics would be good to report if I can only have
>>> two data points -- a legacy "single-core" implementation of the
>>> algorithms and a MapReduce version running on a cluster full cluster?
>>> 
>>> Thanks for any suggestions!
>>> Chris
>> 
>> 



Re: scaling experiments on a static cluster?

2008-03-12 Thread Chris Dyer
Thanks-- that should work.  I'll follow up with the cluster
administrators to see if I can get this to happen.  To rebalance the
file storage can I just set the replication factor using "hadoop dfs"?
Chris

On Wed, Mar 12, 2008 at 6:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>  What about just taking down half of the nodes and then loading your data
>  into the remainder?  Should take about 20 minutes each time you remove nodes
>  but only a few seconds each time you add some.  Remember that you need to
>  reload the data each time (or rebalance it if growing the cluster) to get
>  realistic numbers.
>
>  My suggested procedure would be to take all but 2 nodes down, and then
>
>  - run test
>  - double number of nodes
>  - rebalance file storage
>  - lather, rinse, repeat
>
>
>
>
>  On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:
>
>  > Hi Hadoop mavens-
>  > I'm hoping someone out there will have a quick solution for me.  I'm
>  > trying to run some very basic scaling experiments for a rapidly
>  > approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
>  > with 2 procs/node.  Ideally, I would want to run my code on clusters
>  > of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
>  > The problem is that I am not able to reconfigure the cluster (in the
>  > long run, i.e., before a final version of the paper, I assume this
>  > will be possible, but for now it's not).  Setting the number of
>  > mappers/reducers does not seem to be a viable option, at least not in
>  > the trivial way, since the physical layout of the input files makes
>  > hadoop run different tasks of processes than I may request (most of my
>  > jobs consist of multiple MR steps, the initial one always running on a
>  > relatively small data set, which fits into a single block, and
>  > therefore the Hadoop framework does honor my task number request on
>  > the first job-- but during the later ones it does not).
>  >
>  > My questions:
>  > 1) can I get around this limitation programmatically?  I.e., is there
>  > a way to tell the framework to only use a subset of the nodes for DFS
>  > / mapping / reducing?
>  > 2) if not, what statistics would be good to report if I can only have
>  > two data points -- a legacy "single-core" implementation of the
>  > algorithms and a MapReduce version running on a cluster full cluster?
>  >
>  > Thanks for any suggestions!
>  > Chris
>
>


Re: scaling experiments on a static cluster?

2008-03-12 Thread Ted Dunning

What about just taking down half of the nodes and then loading your data
into the remainder?  Should take about 20 minutes each time you remove nodes
but only a few seconds each time you add some.  Remember that you need to
reload the data each time (or rebalance it if growing the cluster) to get
realistic numbers.

My suggested procedure would be to take all but 2 nodes down, and then

- run test
- double number of nodes
- rebalance file storage
- lather, rinse, repeat


On 3/12/08 3:28 PM, "Chris Dyer" <[EMAIL PROTECTED]> wrote:

> Hi Hadoop mavens-
> I'm hoping someone out there will have a quick solution for me.  I'm
> trying to run some very basic scaling experiments for a rapidly
> approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
> with 2 procs/node.  Ideally, I would want to run my code on clusters
> of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
> The problem is that I am not able to reconfigure the cluster (in the
> long run, i.e., before a final version of the paper, I assume this
> will be possible, but for now it's not).  Setting the number of
> mappers/reducers does not seem to be a viable option, at least not in
> the trivial way, since the physical layout of the input files makes
> hadoop run different tasks of processes than I may request (most of my
> jobs consist of multiple MR steps, the initial one always running on a
> relatively small data set, which fits into a single block, and
> therefore the Hadoop framework does honor my task number request on
> the first job-- but during the later ones it does not).
> 
> My questions:
> 1) can I get around this limitation programmatically?  I.e., is there
> a way to tell the framework to only use a subset of the nodes for DFS
> / mapping / reducing?
> 2) if not, what statistics would be good to report if I can only have
> two data points -- a legacy "single-core" implementation of the
> algorithms and a MapReduce version running on a cluster full cluster?
> 
> Thanks for any suggestions!
> Chris



scaling experiments on a static cluster?

2008-03-12 Thread Chris Dyer
Hi Hadoop mavens-
I'm hoping someone out there will have a quick solution for me.  I'm
trying to run some very basic scaling experiments for a rapidly
approaching paper deadline on a 16.0 Hadoop cluster that has ~20 nodes
with 2 procs/node.  Ideally, I would want to run my code on clusters
of different numbers of nodes (1, 2, 4, 8, 16) or some such thing.
The problem is that I am not able to reconfigure the cluster (in the
long run, i.e., before a final version of the paper, I assume this
will be possible, but for now it's not).  Setting the number of
mappers/reducers does not seem to be a viable option, at least not in
the trivial way, since the physical layout of the input files makes
hadoop run different tasks of processes than I may request (most of my
jobs consist of multiple MR steps, the initial one always running on a
relatively small data set, which fits into a single block, and
therefore the Hadoop framework does honor my task number request on
the first job-- but during the later ones it does not).

My questions:
1) can I get around this limitation programmatically?  I.e., is there
a way to tell the framework to only use a subset of the nodes for DFS
/ mapping / reducing?
2) if not, what statistics would be good to report if I can only have
two data points -- a legacy "single-core" implementation of the
algorithms and a MapReduce version running on a cluster full cluster?

Thanks for any suggestions!
Chris


naming output files from Reduce

2008-03-12 Thread Prasan Ary
I have two Map/Reduce jobs and both of them output a file each. Is there a way 
I can name these output files different from the default names of "part-" ?
   
  thanks.

 __
Do You Yahoo!?
Tired of spam?  Yahoo! Mail has the best spam protection around 
http://mail.yahoo.com 

Re: HDFS interface

2008-03-12 Thread Eddie C
I used this code like this inside of a tomcat web application. It
works. Shared webserver filesystem :)

On Wed, Mar 12, 2008 at 4:50 PM, Hairong Kuang <[EMAIL PROTECTED]> wrote:
> http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample
>
>  Hairong
>
>
>
>
>  On 3/12/08 1:21 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:
>
>  >
>  > http://hadoop.apache.org/core/docs/r0.16.0/hdfs_user_guide.html
>  >
>  > Arun
>  >
>  > On Mar 12, 2008, at 1:16 PM, Cagdas Gerede wrote:
>  >
>  >> I would like to use HDFS component of Hadoop but not interested in
>  >> MapReduce.
>  >> All the Hadoop examples I have seen so far uses MapReduce classes
>  >> and from
>  >> these examples there is no reference to HDFS classes including File
>  >> System
>  >> API of Hadoop (http://hadoop.apache.org/core/docs/current/api/org/
>  >> apache/hadoop/fs/FileSystem.html
>  >> )  >> fs/FileSystem.html>
>  >> Everything seems to happen under the hood.
>  >>
>  >> I was wondering if there is any example source code that is using HDFS
>  >> directly.
>  >>
>  >>
>  >> Thanks,
>  >>
>  >> - CEG
>  >
>
>


Re: HDFS interface

2008-03-12 Thread Hairong Kuang
http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

Hairong


On 3/12/08 1:21 PM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

> 
> http://hadoop.apache.org/core/docs/r0.16.0/hdfs_user_guide.html
> 
> Arun
> 
> On Mar 12, 2008, at 1:16 PM, Cagdas Gerede wrote:
> 
>> I would like to use HDFS component of Hadoop but not interested in
>> MapReduce.
>> All the Hadoop examples I have seen so far uses MapReduce classes
>> and from
>> these examples there is no reference to HDFS classes including File
>> System
>> API of Hadoop (http://hadoop.apache.org/core/docs/current/api/org/
>> apache/hadoop/fs/FileSystem.html
>> )> fs/FileSystem.html>
>> Everything seems to happen under the hood.
>> 
>> I was wondering if there is any example source code that is using HDFS
>> directly.
>> 
>> 
>> Thanks,
>> 
>> - CEG
> 



Re: Does Hadoop Honor Reserved Space?

2008-03-12 Thread Eric Baldeschwieler

Hi Pete, Joydeep,

These sound like thoughts that could lead to excellent suggestions  
with a little more investment of your time.


We'd love it if you could invest some effort into contributing to the  
release process!  Hadoop is open source and becoming active  
contributors is the best possible way to address shortcomings that  
impact your organization.


Thanks for your help!

E14



On Mar 10, 2008, at 8:43 PM, Pete Wyckoff wrote:



+1

(obviously :))


On 3/10/08 5:26 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:


I have left some comments behind on the jira.

We could argue over what's the right thing to do (and we will on the
Jira) - but the higher level problem is that this is another case  
where
backwards compatibility with existing semantics of this option was  
not

carried over. Neither was there any notification to admins about this
change. The change notes just do not convey the import of this  
change to

existing deployments (incidentally 1463 was classified as 'Bug Fix' -
not that putting under 'Incompatible Fix' would have helped imho).

Would request the board/committers to consider setting up something
along the lines of:

1. have something better than Change Notes to convey interface  
changes

2. a field in the JIRA that marks it out as important from interface
change point of view (with notes on what's changing). This could  
be used

to auto-populate #1
3. Some way of auto-subscribing to bugs that are causing interface
changes (even an email filter on the jira mails would do).

As Hadoop user base keeps growing - and gets used for 'production'  
tasks
- I think it's absolutely essential that users/admins can keep in  
tune
with changes that affect their deployments. Otherwise - any  
organization

other than Yahoo would have tough time upgrading.

(I am new to open-source - but surely this has been solved before?)

Joydeep

-Original Message-
From: Hairong Kuang [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2008 5:17 PM
To: core-user@hadoop.apache.org
Subject: Re: Does Hadoop Honor Reserved Space?

I think you have a misunderstanding of the reserved parameter. As I
commented on hadoop-1463, remember that dfs.du.reserve is the  
space for
non-dfs usage, including the space for map/reduce, other  
application, fs

meta-data etc. In your case since /usr already takes 45GB, it far
exceeds
the reserved limit 1G. You should set the reserved space to be 50G.

Hairong


On 3/10/08 4:54 PM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]> wrote:


Filed https://issues.apache.org/jira/browse/HADOOP-2991

-Original Message-
From: Joydeep Sen Sarma [mailto:[EMAIL PROTECTED]
Sent: Monday, March 10, 2008 12:56 PM
To: core-user@hadoop.apache.org; core-user@hadoop.apache.org
Cc: Pete Wyckoff
Subject: RE: Does Hadoop Honor Reserved Space?

folks - Jimmy is right - as we have unfortunately hit it as well:

https://issues.apache.org/jira/browse/HADOOP-1463 caused a  
regression.

we have left some comments on the bug - but can't reopen it.

this is going to be affecting all 0.15 and 0.16 deployments!


-Original Message-
From: Hairong Kuang [mailto:[EMAIL PROTECTED]
Sent: Thu 3/6/2008 2:01 PM
To: core-user@hadoop.apache.org
Subject: Re: Does Hadoop Honor Reserved Space?

In addition to the version, could you please send us a copy of the
datanode
report by running the command bin/hadoop dfsadmin -report?

Thanks,
Hairong


On 3/6/08 11:56 AM, "Joydeep Sen Sarma" <[EMAIL PROTECTED]>  
wrote:



but intermediate data is stored in a different directory from

dfs/data

(something like mapred/local by default i think).

what version are u running?


-Original Message-
From: Ashwinder Ahluwalia on behalf of [EMAIL PROTECTED]
Sent: Thu 3/6/2008 10:14 AM
To: core-user@hadoop.apache.org
Subject: RE: Does Hadoop Honor Reserved Space?

I've run into a similar issue in the past. From what I understand,

this

parameter only controls the HDFS space usage. However, the

intermediate data

in
the map reduce job is stored on the local file system (not HDFS)  
and

is not

subject to this configuration.

In the past I have used mapred.local.dir.minspacekill and
mapred.local.dir.minspacestart to control the amount of space  
that is

allowable
for use by this temporary data.

Not sure if that is the best approach though, so I'd love to hear

what

other

people have done. In your case, you have a map-red job that will

consume too

much space (without setting a limit, you didn't have enough disk

capacity for

the job), so looking at mapred.output.compress and

mapred.compress.map.output

might be useful to decrease the job's disk requirements.

--Ash

-Original Message-
From: Jimmy Wan [mailto:[EMAIL PROTECTED]
Sent: Thursday, March 06, 2008 9:56 AM
To: core-user@hadoop.apache.org
Subject: Does Hadoop Honor Reserved Space?

I've got 2 datanodes setup with the following configuration

parameter:


 dfs.datanode.du.reserved
 429496729600
 Reserved space in bytes per volume

Re: Searching email list

2008-03-12 Thread Daryl C. W. O'Shea
On 12/03/2008 4:18 PM, Cagdas Gerede wrote:
> Is there an easy way to search this email list?
> I couldn't find any web interface.
> 
> Please help.

http://wiki.apache.org/hadoop/MailingListArchives

Daryl



Re: HDFS interface

2008-03-12 Thread Arun C Murthy


http://hadoop.apache.org/core/docs/r0.16.0/hdfs_user_guide.html

Arun

On Mar 12, 2008, at 1:16 PM, Cagdas Gerede wrote:


I would like to use HDFS component of Hadoop but not interested in
MapReduce.
All the Hadoop examples I have seen so far uses MapReduce classes  
and from
these examples there is no reference to HDFS classes including File  
System
API of Hadoop (http://hadoop.apache.org/core/docs/current/api/org/ 
apache/hadoop/fs/FileSystem.html
)

Everything seems to happen under the hood.

I was wondering if there is any example source code that is using HDFS
directly.


Thanks,

- CEG




Searching email list

2008-03-12 Thread Cagdas Gerede
Is there an easy way to search this email list?
I couldn't find any web interface.

Please help.


CEG


Re: HDFS interface

2008-03-12 Thread Cagdas Gerede
I would like to use HDFS component of Hadoop but not interested in
MapReduce.
All the Hadoop examples I have seen so far uses MapReduce classes and from
these examples there is no reference to HDFS classes including File System
API of Hadoop 
(http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/fs/FileSystem.html
)
Everything seems to happen under the hood.

I was wondering if there is any example source code that is using HDFS
directly.


Thanks,

- CEG


Re: Summit Move: More Seats, new Venue (Re: Hadoop summit on March 25th)

2008-03-12 Thread Marc Boucher
Great news Jeremy, thank you for this.

Marc Boucher
Hyperix

On Wed, Mar 12, 2008 at 11:02 AM, Jeremy Zawodny <[EMAIL PROTECTED]> wrote:

> Good news!
>
> We've located a new venue for the Hadoop Summit (not far from Yahoo), have
> capacity for another 75 people, and are still keeping the event free.
> Thanks to Amazon Web Services for chipping in some food money. :-)
>
> Sign up now and pass the word:
>
> http://upcoming.yahoo.com/event/436226/
> http://developer.yahoo.com/blogs/hadoop/2008/03/hadoop-summit-move.html
>
> We're in the process of updating the summit site and notifying a few other
> groups as well.
>
> Thanks for all the interest.  We're looking forward to a day packed full
> of
> Hadoop.
>
> Jeremy
>
> On 3/5/08, Ted Dunning <[EMAIL PROTECTED]> wrote:
> >
> >
> > +1
> >
> > I have a colleague I would like to bring as well.
> >
> > Maybe we need to have an unconference in a park next door and take turns
> > actually being in the hall for the talks.
> >
> >
> >
> > On 3/5/08 1:58 PM, "Bruce Williams" <[EMAIL PROTECTED]> wrote:
> >
> > > It seems like a bigger room in Sunnyvale could be found, doesn't it?
> > > There are tech presentations of different sizes going on everyday in
> the
> > > San Jose area.
> > >
> > > I registered, but have people who I work with who would love to
> attend.
> > > There are many new people coming into Hadoop who would benefit from
> the
> > > Summit. As it has turned out, with the people presently attending,
>  the
> > > result may somewhat be "preaching to the choir"  and  informing the
> > > already  well informed compared to what could happen. ( to the great
> > > benefit of Hadoop )
> > >
> > > Anyway, I am looking forward to a great time! :-)
> > >
> > > Bruce Williams
> > >
> > >
> > > Marc Boucher wrote:
> > >> I'm on the waiting list as well and i'll be in the area anyway on a
> > >> business trip so I'm wondering with so many people wanting to attend
> > >> is there no way to get a bigger venue?
> > >>
> > >> Marc Boucher
> > >> Hyperix
> > >>
> > >>
> > >> On 3/5/08, mickey hsieh <[EMAIL PROTECTED]> wrote:
> > >>
> > >>> Hi Jeremy,
> > >>>
> > >>> It is full again. Current size is 140. The demand is really high, I
> am
> > >>> desperately looking for opportunity to attend.
> > >>>
> > >>> Is there any chance to get up a couple more slots?
> > >>>
> > >>> Thanks,
> > >>>
> > >>> Mickey Hsieh
> > >>> Fox Interactive Media
> > >>>
> > >>>
> > >>>
> > >>> On 2/28/08, Jeremy Zawodny <[EMAIL PROTECTED]> wrote:
> > >>>
> >  I've bumped up the numbers on Upcoming.org to allow more folks to
> > attend.
> >  The room might be a little crowded, but we'll make it work.
> > 
> >  We're also looking at webcasting in addition to posting video after
> > the
> >  summit.
> > 
> > 
> > 
> > 
> > >>>
> >
> http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_summit_nearly_full_we
> > >>> ll.html
> > >>>
> >  http://upcoming.yahoo.com/event/436226/
> >  http://developer.yahoo.com/hadoop/summit/
> > 
> >  Register soon if you haven't already.
> > 
> >  Thanks!
> > 
> >  Jeremy
> > 
> >  On 2/25/08, chris <[EMAIL PROTECTED]> wrote:
> > 
> > > I see the class is full with more than 50 watchers. Any chance the
> > size
> > > will
> > > expand? If not, any date in mind for a second one?
> > >
> > >
> > >>
> > >>
> > >>
> > >
> >
> >
>


RE: reading input file only once for multiple map functions

2008-03-12 Thread Joydeep Sen Sarma
the short answer is no - can't do this.

there are some special cases:

if the map output key for the same xml record is the same for both the jobs 
(ie. sort/partition/grouping is based on same value) - then you can do this in 
the application layer.

if the map output keys differs - then there's no way to do this. u can combine 
both the jobs and send tagged outputs from each job's map function - but that 
doesn't achieve much (saves file scan - but unrelated data has now got to be 
sorted together - which on the whole may be a loss rather than a win).

if the map itself can achieve a dramatic reduction in data size - then u can 
consider running a first job that has no reduces - just applies the map 
functions to produce two sets of (much smaller) files that are written out to 
hdfs. then u can launch two jobs (with identity mapper and the original reduce 
functions) that work against these data sets. so u will have three jobs - but 
only a single scan of the initial file.

---

it would help if the nature of the problem was described (size/schema of input 
data, outputs desired) - rather than the solution that u are trying to 
implement.

-Original Message-
From: Prasan Ary [mailto:[EMAIL PROTECTED]
Sent: Wed 3/12/2008 10:38 AM
To: core-user@hadoop.apache.org
Subject: Re: reading input file only once for multiple map functions
 
Ted,
  Say I have two Mapper classes . Map function for both of these classes get 
their input split from a very large XML file.
   
  Right now I am creating two different jobs, Job_1 and Job_2 , and both of 
these jobs have the same input path ( to the XML file) . However, since I am 
using a custum InputFormat to split the XML at record boundary, all splits for 
Job_1 and Job_2 should be the same ( equal to the number of records in XML).
   
  So basically I am splitting the XML twice, and getting same split each time. 
It would be nice if I could split the XML once, and send those splits to Map of 
Job_1 and Job_2. 

  ===
  
Ted Dunning <[EMAIL PROTECTED]> wrote:
  
Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines. The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same. What is the point of parallelism in
that case? Are your maps random in some sense? Are they really operating
on different parts of the single input? If so, shouldn't they just be
getting the part of the input that they will be working on?


Perhaps you should describe what you are trying to do at a higher level. It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.


On 3/12/08 9:24 AM, "Prasan Ary" wrote:

> I have a very large xml file as input and a couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
> I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
> 
> thanks.
> 
> 
> 
> -
> Looking for last minute shopping deals? Find them fast with Yahoo! Search.



   
-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.



Re: reading input file only once for multiple map functions

2008-03-12 Thread Ted Dunning

Ahhh...

There is an old saying for this.  I think you are pulling fly specks out of
pepper.

Unless your input format is very, very strange, doing the split again for
two jobs does, indeed, lead to some small inefficiency, but this cost should
be so low compared to other inefficiencies that you are wasting your time to
try to optimize that away.  Remember, you don't know where the maps will
execute so getting the split to the correct nodes will be a nightmare.

If your splitting is actually so expensive that you can measure it, then you
should consider changing formats.  This is analogous to having a single
gzipped input file.  Splitting such a file involves reading the file from
the beginning because gzip is a stream compression algorithm.  There are a
few proposals going around to optimize that by concatenating gzip files with
special marker files in between, but the real answer is to either not use
gzipped input files or to split the files before gzipping.


On 3/12/08 10:38 AM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

>   So basically I am splitting the XML twice, and getting same split each time.
> It would be nice if I could split the XML once, and send those splits to Map
> of Job_1 and Job_2.



Summit Move: More Seats, new Venue (Re: Hadoop summit on March 25th)

2008-03-12 Thread Jeremy Zawodny
Good news!

We've located a new venue for the Hadoop Summit (not far from Yahoo), have
capacity for another 75 people, and are still keeping the event free.
Thanks to Amazon Web Services for chipping in some food money. :-)

Sign up now and pass the word:

http://upcoming.yahoo.com/event/436226/
http://developer.yahoo.com/blogs/hadoop/2008/03/hadoop-summit-move.html

We're in the process of updating the summit site and notifying a few other
groups as well.

Thanks for all the interest.  We're looking forward to a day packed full of
Hadoop.

Jeremy

On 3/5/08, Ted Dunning <[EMAIL PROTECTED]> wrote:
>
>
> +1
>
> I have a colleague I would like to bring as well.
>
> Maybe we need to have an unconference in a park next door and take turns
> actually being in the hall for the talks.
>
>
>
> On 3/5/08 1:58 PM, "Bruce Williams" <[EMAIL PROTECTED]> wrote:
>
> > It seems like a bigger room in Sunnyvale could be found, doesn't it?
> > There are tech presentations of different sizes going on everyday in the
> > San Jose area.
> >
> > I registered, but have people who I work with who would love to attend.
> > There are many new people coming into Hadoop who would benefit from the
> > Summit. As it has turned out, with the people presently attending,  the
> > result may somewhat be "preaching to the choir"  and  informing the
> > already  well informed compared to what could happen. ( to the great
> > benefit of Hadoop )
> >
> > Anyway, I am looking forward to a great time! :-)
> >
> > Bruce Williams
> >
> >
> > Marc Boucher wrote:
> >> I'm on the waiting list as well and i'll be in the area anyway on a
> >> business trip so I'm wondering with so many people wanting to attend
> >> is there no way to get a bigger venue?
> >>
> >> Marc Boucher
> >> Hyperix
> >>
> >>
> >> On 3/5/08, mickey hsieh <[EMAIL PROTECTED]> wrote:
> >>
> >>> Hi Jeremy,
> >>>
> >>> It is full again. Current size is 140. The demand is really high, I am
> >>> desperately looking for opportunity to attend.
> >>>
> >>> Is there any chance to get up a couple more slots?
> >>>
> >>> Thanks,
> >>>
> >>> Mickey Hsieh
> >>> Fox Interactive Media
> >>>
> >>>
> >>>
> >>> On 2/28/08, Jeremy Zawodny <[EMAIL PROTECTED]> wrote:
> >>>
>  I've bumped up the numbers on Upcoming.org to allow more folks to
> attend.
>  The room might be a little crowded, but we'll make it work.
> 
>  We're also looking at webcasting in addition to posting video after
> the
>  summit.
> 
> 
> 
> 
> >>>
> http://developer.yahoo.com/blogs/hadoop/2008/02/hadoop_summit_nearly_full_we
> >>> ll.html
> >>>
>  http://upcoming.yahoo.com/event/436226/
>  http://developer.yahoo.com/hadoop/summit/
> 
>  Register soon if you haven't already.
> 
>  Thanks!
> 
>  Jeremy
> 
>  On 2/25/08, chris <[EMAIL PROTECTED]> wrote:
> 
> > I see the class is full with more than 50 watchers. Any chance the
> size
> > will
> > expand? If not, any date in mind for a second one?
> >
> >
> >>
> >>
> >>
> >
>
>


Re: reading input file only once for multiple map functions

2008-03-12 Thread Prasan Ary
Ted,
  Say I have two Mapper classes . Map function for both of these classes get 
their input split from a very large XML file.
   
  Right now I am creating two different jobs, Job_1 and Job_2 , and both of 
these jobs have the same input path ( to the XML file) . However, since I am 
using a custum InputFormat to split the XML at record boundary, all splits for 
Job_1 and Job_2 should be the same ( equal to the number of records in XML).
   
  So basically I am splitting the XML twice, and getting same split each time. 
It would be nice if I could split the XML once, and send those splits to Map of 
Job_1 and Job_2. 

  ===
  
Ted Dunning <[EMAIL PROTECTED]> wrote:
  
Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines. The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same. What is the point of parallelism in
that case? Are your maps random in some sense? Are they really operating
on different parts of the single input? If so, shouldn't they just be
getting the part of the input that they will be working on?


Perhaps you should describe what you are trying to do at a higher level. It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.


On 3/12/08 9:24 AM, "Prasan Ary" wrote:

> I have a very large xml file as input and a couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
> I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
> 
> thanks.
> 
> 
> 
> -
> Looking for last minute shopping deals? Find them fast with Yahoo! Search.



   
-
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: reading input file only once for multiple map functions

2008-03-12 Thread Ted Dunning

Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines.  The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same.  What is the point of parallelism in
that case?  Are your maps random in some sense?  Are they really operating
on different parts of the single input?  If so, shouldn't they just be
getting the part of the input that they will be working on?


Perhaps you should describe what you are trying to do at a higher level.  It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.


On 3/12/08 9:24 AM, "Prasan Ary" <[EMAIL PROTECTED]> wrote:

> I have a very large xml file as input and a  couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
>   I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
>
>   thanks.
>
> 
>
> -
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.



reading input file only once for multiple map functions

2008-03-12 Thread Prasan Ary
I have a very large xml file as input and a  couple of Map/Reduce functions. 
Input key/value pair to all of my map functions is the same. 
  I was wondering if there is a way that I read the input xml file only once, 
then create key/value pair (also once) and give these k/v pairs as input to my 
map functions as opposed to having to read the xml and generate key/value once 
for each map functions?
   
  thanks.
   

   
-
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

Re: performance

2008-03-12 Thread Ted Dunning


Identity reduce is nice because the result values can be sorted.


On 3/12/08 8:21 AM, "Jason Rennie" <[EMAIL PROTECTED]> wrote:

> Map could perform all the dot-products, which is the heavy lifting
> in what we're trying to do.  Might want to do a reduce after that, not
> sure...



Re: performance

2008-03-12 Thread Theodore Van Rooy
I have been using the HDFS, setting the block size to some appropriate level
and the replication as well.  When submitting the job keep in mind that each
block of the file in the HDFS will be passed into your mapping script as
Standard Input.  The datafile calls will be done locally if possible.  This
gives you a lot of options in regard to your replication and block size
settings.

Overall, it's very possible to optimize mapReduce for your specific job, you
just have to know how it does things.  Root around inside the file system
and watch it as it loads up the actual jobs.

Check out the streaming documentation for more idea on how to optimize your
streaming experience.

On Wed, Mar 12, 2008 at 9:21 AM, Jason Rennie <[EMAIL PROTECTED]> wrote:

> Hmm... sounds promising :)  How do you distribute the data?  Do you use
> HDFS?  Pass the data directly to the individual nodes?  We really only
> need
> to do the map operation like you.  We need to distribute a matrix * vector
> operation, so we want rows of the matrix distributed across different
> nodes.  Map could perform all the dot-products, which is the heavy lifting
> in what we're trying to do.  Might want to do a reduce after that, not
> sure...
>
> Jason
>
> On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
> wrote:
>
> > There is overhead in grabbing local data, moving it in and out of the
> > system
> > and especially if you are running a map reduce job (like wc) which ends
> up
> > mapping, sorting, copying, reducing, and writing again.
> >
> > One way I've found to get around the overhead is to use Hadoop streaming
> > and
> > perform map only tasks.  While they recommend doing it properly with
> >
> > hstream -mapper /bin/cat -reducer /bin/wc
> >
> > I tried:
> >
> > hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> > -numReduceTasks 0
> >
> > (hstream is just an alias to do Hadoop streaming)
> >
> > And saw an immediate speedup on a 1 Gig and 10 Gig file.
> >
> > In the end you may have several output files with the wordcount for each
> > file, but adding those files together is pretty quick and easy.
> >
> > My recommendation is to explore how how you can get away with either
> > Identity Reduces, Maps or no reduces at all.
> >
> > Theo
> >
>
> --
> Jason Rennie
> Head of Machine Learning Technologies, StyleFeeder
> http://www.stylefeeder.com/
> Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com


Re: performance

2008-03-12 Thread Jason Rennie
Hmm... sounds promising :)  How do you distribute the data?  Do you use
HDFS?  Pass the data directly to the individual nodes?  We really only need
to do the map operation like you.  We need to distribute a matrix * vector
operation, so we want rows of the matrix distributed across different
nodes.  Map could perform all the dot-products, which is the heavy lifting
in what we're trying to do.  Might want to do a reduce after that, not
sure...

Jason

On Tue, Mar 11, 2008 at 6:36 PM, Theodore Van Rooy <[EMAIL PROTECTED]>
wrote:

> There is overhead in grabbing local data, moving it in and out of the
> system
> and especially if you are running a map reduce job (like wc) which ends up
> mapping, sorting, copying, reducing, and writing again.
>
> One way I've found to get around the overhead is to use Hadoop streaming
> and
> perform map only tasks.  While they recommend doing it properly with
>
> hstream -mapper /bin/cat -reducer /bin/wc
>
> I tried:
>
> hstream -input "myinputfile" -output "myoutput" -mapper /bin/wc
> -numReduceTasks 0
>
> (hstream is just an alias to do Hadoop streaming)
>
> And saw an immediate speedup on a 1 Gig and 10 Gig file.
>
> In the end you may have several output files with the wordcount for each
> file, but adding those files together is pretty quick and easy.
>
> My recommendation is to explore how how you can get away with either
> Identity Reduces, Maps or no reduces at all.
>
> Theo
>

-- 
Jason Rennie
Head of Machine Learning Technologies, StyleFeeder
http://www.stylefeeder.com/
Samantha's blog & pictures: http://samanthalyrarennie.blogspot.com/


0.16.0 reduce NullPointEreceptions - is this a known issue, is there a fix?

2008-03-12 Thread Jason Venner
We see this in some of our reduce jobs, on one particular cluster. This 
is the only message in the log file for the failed reduce.


Exception in thread "main" java.lang.NullPointerException
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2059)

 /** 
  * The main() for child processes. 
  */

2039  public static class Child {
   
2041public static void main(String[] args) throws Throwable {


2058  Task task = umbilical.getTask(taskid);
2059  JobConf job = new JobConf(task.getJobFile());
2060  TaskLog.cleanup(job.getInt("mapred.userlog.retain.hours", 24));
2061  task.setConf(job);



--
Jason Venner
Attributor - Publish with Confidence 
Attributor is hiring Hadoop Wranglers, contact if interested



Re: Hadoop streaming question

2008-03-12 Thread Andrey Pankov

Hi Amareshwari,

I have applied that patch and run my job successfully. I had to specify 
jar file with '-file' option, even if it is available via $CLASSPATH:


$HSTREAMING -mapper org.company.TestMapper -reducer "cat" -input /data 
-output /out4 -file /path/to/test_mapper.jar


Thanks a lot!


Amareshwari Sriramadasu wrote:

Hi Andrey,

I think that is classpath problem.
Can you try using patch at 
https://issues.apache.org/jira/browse/HADOOP-2622 and see you still have 
the problem?


Thanks
Amareshwari.

Andrey Pankov wrote:

Hi all,

I'm still new to Hadoop. I'd like to use Hadoop streaming in order to 
combine mapper as Java class and reducer as C++ program. Currently I'm 
at the beginning of this task and now I have troubles with Java class. 
 It looks something like



package org.company;
 ...
public class TestMapper extends MapReduceBase implements Mapper {
 ...
  public void map(WritableComparable key, Writable value,
OutputCollector output, Reporter reporter) throws IOException {
 ...


I created jar file with my class and it is accessible via $CLASSPATH. 
I'm running stream job using


$HSTREAMING -mapper org.company.TestMapper -reducer "wc -l" -input 
/data -output /out1


Hadoop cannot find TestMapper class. I'm using hadoop-0.16.0. The 
error is


===
2008-03-07 18:58:07,734 INFO org.apache.hadoop.metrics.jvm.JvmMetrics: 
Initializing JVM Metrics with processName=MAP, sessionId=
2008-03-07 18:58:07,833 INFO org.apache.hadoop.mapred.MapTask: 
numReduceTasks: 1
2008-03-07 18:58:07,910 WARN org.apache.hadoop.mapred.TaskTracker: 
Error running child
java.lang.RuntimeException: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.company.TestMapper
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:639)
at 
org.apache.hadoop.mapred.JobConf.getMapperClass(JobConf.java:728)
at 
org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:36)
at 
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
at 
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:82) 


at org.apache.hadoop.mapred.MapTask.run(MapTask.java:204)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2071)
Caused by: java.lang.RuntimeException: 
java.lang.ClassNotFoundException: org.company.TestMapper
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:607)
at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:631)

... 6 more
Caused by: java.lang.ClassNotFoundException: org.company.TestMapper
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:319)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:247)
at 
org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:587) 

at 
org.apache.hadoop.conf.Configuration.getClass(Configuration.java:605)

... 7 more
===

What is interesting for me. I had put into Hadoop streaming 
(StreamJob.java and StreamUtil.java) some debugging println(). 
Streaming can see TestMapper on job configuration stage 
(StreamJob.setJobConf() routine) but cannot later. Next code creates 
new instance of TestMapper and calls toString() defined in TestMapper. 
It works.


if (mapCmd_ != null) {
  c = StreamUtil.goodClassOrNull(mapCmd_, defaultPackage);
  if (c != null) {
System.out.println("###");
try {
System.out.println(c.newInstance().toString());
} catch (Exception e) { }
System.out.println("###");
jobConf_.setMapperClass(c);
  } else {
...
  }
}


I tried to add jar file with TestMapper using option
 "-file test_mapper.jar" . The result is the same.

Could anybody advice me something? Thanks in advance,

---
Andrey Pankov.






---
Andrey Pankov.