MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Goel, Ankur
Hi Folks,
  I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB)
Using TextInputFormat resulted in too many mappers (1 per file) creating
a lot of overhead so I switched to
MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop automatically
takes care of generating the right
number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the
application to specify the right number of mappers
or am I missing something ? Please advise.
 
Thanks
-Ankur


Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
<[EMAIL PROTECTED]> wrote:
> Thank you, Tom.
>
> Forgive me for being dense, but I don't understand your reply:
>

Sorry! I'll try to explain it better (see below).

>
> Do you mean that it is possible to use the Hadoop daemons with S3 but
> the default filesystem must be HDFS?

The HDFS daemons use the value of "fs.default.name" to set the
namenode host and port, so if you set it to a s3 URI, you can't run
the HDFS daemons. So in this case you would use the start-mapred.sh
script instead of start-all.sh.

> If that is the case, can I
> specify the output filesystem on a per-job basis and can that be an S3
> FS?

Yes, that's exactly how you do it.

>
> Also, is there a particular reason to not allow S3 as the default FS?

You can allow S3 as the default FS, it's just that then you can't run
HDFS at all in this case. You would only do this if you don't want to
use HDFS at all, for example, if you were running a MapReduce job
which read from S3 and wrote to S3.

It might be less confusing if the HDFS daemons didn't use
fs.default.name to define the namenode host and port. Just like
mapred.job.tracker defines the host and port for the jobtracker,
dfs.namenode.address (or similar) could define the namenode. Would
this be a good change to make?

Tom


Re: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Enis Soztutar
MultiFileSplit currently does not support automatic map task count 
computation. You can manually
set the number of maps via jobConf#setNumMapTasks() or via command line 
arg -D mapred.map.tasks=



Goel, Ankur wrote:

Hi Folks,
  I am using hadoop to process some temporal data which is
split in lot of small files (~ 3 - 4 MB)
Using TextInputFormat resulted in too many mappers (1 per file) creating
a lot of overhead so I switched to
MultiFileInputFormat - (MutiFileWordCount.MyInputFormat) which resulted
in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop automatically

takes care of generating the right
number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the

application to specify the right number of mappers
or am I missing something ? Please advise.
 
Thanks

-Ankur

  


RE: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Goel, Ankur
In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?

-Ankur


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 11, 2008 7:21 PM
To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers

MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=


Goel, Ankur wrote:
> Hi Folks,
>   I am using hadoop to process some temporal data which is

> split in lot of small files (~ 3 - 4 MB) Using TextInputFormat 
> resulted in too many mappers (1 per file) creating a lot of overhead 
> so I switched to MultiFileInputFormat - 
> (MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
>  
> I was hoping to set the no of mappers to 1 so that hadoop 
> automatically takes care of generating the right number of map tasks.
>  
> Looks like when using MultiFileInputFormat one has to rely on the 
> application to specify the right number of mappers or am I missing 
> something ? Please advise.
>  
> Thanks
> -Ankur
>
>   


Maven

2008-07-11 Thread Larry Compton
Are the Hadoop JAR files housed in a Maven repository somewhere? If so,
please post the repository URL.

TIA

-- 
Larry Compton


Re: MultiFileInputFormat - Not enough mappers

2008-07-11 Thread Enis Soztutar
Yes, please open a jira for this. We should ensure that 
avgLengthPerSplit in MultiFileInputFormat should not exceed default file 
block size. However unlike FileInputFormat, all the files will come from 
a different block.



Goel, Ankur wrote:

In this case I have to compute the number of map tasks in the
application - (totalSize / blockSize), which is what I am doing as a
work-around.
I think this should be the default behaviour in MultiFileInputFormat.
Should a JIRA be opened for the same ?

-Ankur


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Friday, July 11, 2008 7:21 PM

To: core-user@hadoop.apache.org
Subject: Re: MultiFileInputFormat - Not enough mappers

MultiFileSplit currently does not support automatic map task count
computation. You can manually set the number of maps via
jobConf#setNumMapTasks() or via command line arg -D
mapred.map.tasks=


Goel, Ankur wrote:
  

Hi Folks,
  I am using hadoop to process some temporal data which is



  
split in lot of small files (~ 3 - 4 MB) Using TextInputFormat 
resulted in too many mappers (1 per file) creating a lot of overhead 
so I switched to MultiFileInputFormat - 
(MutiFileWordCount.MyInputFormat) which resulted in just 1 mapper.
 
I was hoping to set the no of mappers to 1 so that hadoop 
automatically takes care of generating the right number of map tasks.
 
Looks like when using MultiFileInputFormat one has to rely on the 
application to specify the right number of mappers or am I missing 
something ? Please advise.
 
Thanks

-Ankur

  



  


Re: Maven

2008-07-11 Thread tim robertson
No there isn't unfortunately...

I use this, so I can quickly change versions:


http://maven.apache.org/POM/4.0.0"; xmlns:xsi="
http://www.w3.org/2001/XMLSchema-instance"; xsi:schemaLocation="
http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd";>
  4.0.0
  org.apache.hadoop
  hadoop
  Hadoop (${hadoop.version})
  jar
  ${hadoop.version}
  

  
maven-install-plugin

  
install-hadoop
install

  install-file


  hadoop-jars/hadoop-${hadoop.version}-core.jar
  org.apache.hadoop
  hadoop-core
  jar
  ${hadoop.version}
  true
  true

  

  

  


On Fri, Jul 11, 2008 at 4:11 PM, Larry Compton <[EMAIL PROTECTED]>
wrote:

> Are the Hadoop JAR files housed in a Maven repository somewhere? If so,
> please post the repository URL.
>
> TIA
>
> --
> Larry Compton
>


RE: MapReduce with multi-languages

2008-07-11 Thread Koji Noguchi
Hi.

Asked Runping about this.
Here's his reply.

Koji 


=
On 7/10/08 11:16 PM, "Koji Noguchi" <[EMAIL PROTECTED]> wrote:
> > Runping,
> > 
> > Can they use Buffer class?
> > 
> > Koji

Yes, use Buffer or ByteWritable for the key/value classes.
But the critical point is to implement their own record reader/input
format classes.
Runping

=

-Original Message-
From: NOMURA Yoshihide [mailto:[EMAIL PROTECTED] 
Sent: Thursday, July 10, 2008 10:36 PM
To: core-user@hadoop.apache.org
Subject: Re: MapReduce with multi-languages

Mr. Taeho Kang,

I need to analyze different character encoding text too.
And I suggested to support encoding configuration in TextInputFormat.

https://issues.apache.org/jira/browse/HADOOP-3481

But I think you should convert the text file encoding to UTF-8 at
present.

Regards,

Taeho Kang:
> Dear Hadoop User Group,
> 
> What are elegant ways to do mapred jobs on text-based data encoded
with
> something other than UTF-8?
> 
> It looks like Hadoop assumes the text data is always in UTF-8 and
handles
> data that way - encoding with UTF-8 and decoding with UTF-8.
> And whenever the data is not in UTF-8 encoded format, problems arise.
> 
> Here is what I'm thinking of to clear the situation.. correct and
advise me
> if you see my approaches look bad!
> 
> (1) Re-encode the original data with UTF-8?
> (2) Replace the part of source code where UTF-8 encoder and decoder
are
> used?
> 
> Or has anyone of you guys had trouble with running map-red job on data
with
> multi-languages?
> 
> Any suggestions/advices are welcome and appreciated!
> 
> Regards,
> 
> Taeho
> 

-- 
NOMURA Yoshihide:
 Software Innovation Laboratory, Fujitsu Labs. Ltd., Japan
 Tel: 044-754-2675 (Ext: 7106-6916)
 Fax: 044-754-2570 (Ext: 7108-7060)
 E-Mail: [EMAIL PROTECTED]



Re: java.io.IOException: All datanodes are bad. Aborting...

2008-07-11 Thread Shengkai Zhu
Did you do the clean work on all the datanodes?
rm -Rf /path/to/my/hadoop/dfs/data


On 6/20/08, novice user <[EMAIL PROTECTED]> wrote:
>
>
> Hi Mori Bellamy,
> I did this twice.  and still the same problem is persisting. I don't know
> how to solve this issue. If any one know the answer, please let me know.
>
> Thanks
>
> Mori Bellamy wrote:
> >
> > That's bizarre. I'm not sure why your DFS would have magically gotten
> > full. Whenever hadoop gives me trouble, i try the following sequence
> > of commands
> >
> > stop-all.sh
> > rm -Rf /path/to/my/hadoop/dfs/data
> > hadoop namenode -format
> > start-all.sh
> >
> > maybe you would get some luck if you ran that on all of the machines?
> > (of course, don't run it if you don't want to lose all of that "data")
> > On Jun 19, 2008, at 4:32 AM, novice user wrote:
> >
> >>
> >> Hi Every one,
> >> I am running a simple map-red application similar to k-means. But,
> >> when I
> >> ran it in on single machine, it went fine with out any issues. But,
> >> when I
> >> ran the same on a hadoop cluster of 9 machines. It fails saying
> >> java.io.IOException: All datanodes are bad. Aborting...
> >>
> >> Here is more explanation about the problem:
> >> I tried to upgrade my hadoop cluster to hadoop-17. During this
> >> process, I
> >> made a mistake of not installing hadoop on all machines. So, the
> >> upgrade
> >> failed. Nor I was able to roll back.  So, I re-formatted the name node
> >> afresh. and then hadoop installation was successful.
> >>
> >> Later, when I ran my map-reduce job, it ran successfully,but  the
> >> same job
> >> with zero reduce tasks is failing with the error as:
> >> java.io.IOException: All datanodes  are bad. Aborting...
> >>
> >> When I looked into the data nodes, I figured out that file system is
> >> 100%
> >> full with different directories of name "subdir" in
> >> hadoop-username/dfs/data/current directory. I am wondering where I
> >> went
> >> wrong.
> >> Can some one please help me on this?
> >>
> >> The same job went fine on a single machine with same amount of input
> >> data.
> >>
> >> Thanks
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/java.io.IOException%3A-All-datanodes-are-bad.-Aborting...-tp18006296p18006296.html
> >> Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >>
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/java.io.IOException%3A-All-datanodes-are-bad.-Aborting...-tp18006296p18022330.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: dfs copyFromLocal/put fails

2008-07-11 Thread Shengkai Zhu
Firewall problem or you need entries into /etc/hosts


On 6/18/08, Alexander Arimond <[EMAIL PROTECTED]> wrote:
>
> hi,
>
> i'm new in hadoop and im just testing it at the moment.
> i set up a cluster with 2 nodes and it seems like they are running
> normally,
> the log files of the namenode and the datanodes dont show errors.
> Firewall should be set right.
> but when i try to upload a file to the dfs i get following message:
>
> [EMAIL PROTECTED]:~/hadoop$ bin/hadoop dfs -put file.txt file.txt
> 08/06/12 14:44:19 INFO dfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 08/06/12 14:44:19 INFO dfs.DFSClient: Abandoning block
> blk_5837981856060447217
> 08/06/12 14:44:28 INFO dfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 08/06/12 14:44:28 INFO dfs.DFSClient: Abandoning block
> blk_2573458924311304120
> 08/06/12 14:44:37 INFO dfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 08/06/12 14:44:37 INFO dfs.DFSClient: Abandoning block
> blk_1207459436305221119
> 08/06/12 14:44:46 INFO dfs.DFSClient: Exception in
> createBlockOutputStream java.net.ConnectException: Connection refused
> 08/06/12 14:44:46 INFO dfs.DFSClient: Abandoning block
> blk_-8263828216969765661
> 08/06/12 14:44:52 WARN dfs.DFSClient: DataStreamer Exception:
> java.io.IOException: Unable to create new block.
> 08/06/12 14:44:52 WARN dfs.DFSClient: Error Recovery for block
> blk_-8263828216969765661 bad datanode[0]
>
>
> dont know what that means and didnt found something about that..
> Hope somebody can help with that.
>
> Thank you!
>
>


-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: Outputting to different paths from the same input file

2008-07-11 Thread Jason Venner
We open side effect files in our map and reduce jobs to 'tee' off 
additional data streams.

We open them in the /configure/ method and close them in the /close/ method
The /configure/ method provides access to the /JobConf.

/We create our files relative to value of conf.get("mapred.output.dir"), 
in the map/reduce object instances.


The files end up in the conf.getOutputPath() directory, and we move them 
out based on knowing the shape of the file names, after the job finishes.



Then after the job is finished move all of the files to another location 
using a file name based filter to select the files to move (from the job


schnitzi wrote:

Okay, I've found some similar discussions in the archive, but I'm still not
clear on this.  I'm new to Hadoop, so 'scuse my ignorance...

I'm writing a Hadoop tool to read in an event log, and I want to produce two
separate outputs as a result -- one for statistics, and one for budgeting. 
Because the event log I'm reading in can be massive, I would like to only

process it once.  But the outputs will each be read by further M/R
processes, and will be significantly different from each other.

I've looked at MultipleOutputFormat, but it seems to just want to partition
data that looks basically the same into this file or that.

What's the proper way to do this?  Ideally, whatever solution I implement
should be atomic, in that if any one of the writes fails, neither output
will be produced.


AdTHANKSvance,
Mark
  

--
Jason Venner
Attributor - Program the Web 
Attributor is hiring Hadoop Wranglers and coding wizards, contact if 
interested


Re: Namenode Exceptions with S3

2008-07-11 Thread Lincoln Ritter
Thanks Tom!

Your explanation makes things a lot clearer.  I think that changing
the 'fs.default.name' to something like 'dfs.namenode.address' would
certainly be less confusing since it would clarify the purpose of
these values.

-lincoln

--
lincolnritter.com



On Fri, Jul 11, 2008 at 4:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
> <[EMAIL PROTECTED]> wrote:
>> Thank you, Tom.
>>
>> Forgive me for being dense, but I don't understand your reply:
>>
>
> Sorry! I'll try to explain it better (see below).
>
>>
>> Do you mean that it is possible to use the Hadoop daemons with S3 but
>> the default filesystem must be HDFS?
>
> The HDFS daemons use the value of "fs.default.name" to set the
> namenode host and port, so if you set it to a s3 URI, you can't run
> the HDFS daemons. So in this case you would use the start-mapred.sh
> script instead of start-all.sh.
>
>> If that is the case, can I
>> specify the output filesystem on a per-job basis and can that be an S3
>> FS?
>
> Yes, that's exactly how you do it.
>
>>
>> Also, is there a particular reason to not allow S3 as the default FS?
>
> You can allow S3 as the default FS, it's just that then you can't run
> HDFS at all in this case. You would only do this if you don't want to
> use HDFS at all, for example, if you were running a MapReduce job
> which read from S3 and wrote to S3.
>
> It might be less confusing if the HDFS daemons didn't use
> fs.default.name to define the namenode host and port. Just like
> mapred.job.tracker defines the host and port for the jobtracker,
> dfs.namenode.address (or similar) could define the namenode. Would
> this be a good change to make?
>
> Tom
>


JobClient question

2008-07-11 Thread Larry Compton
I'm coming up to speed on the Hadoop APIs. I need to be able to invoke a job
from within a Java application (as opposed to running from the command-line
"hadoop" executable). The JobConf and JobClient appear to support this and
I've written a test program to configure and run a job. However, the job
doesn't appear to be submitted to the JobTracker. Here's a code excerpt from
my client...

String rdfInputPath = args[0];
String outputPath = args[1];
String uriInputPath = args[2];
String jarPath = args[3];

JobConf conf = new JobConf(MaterializeMap.class);
conf.setJobName("materialize");

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);

conf.setMapperClass(MaterializeMapper.class);
conf.setCombinerClass(MaterializeReducer.class);
conf.setReducerClass(MaterializeReducer.class);
conf.setJar(jarPath);

DistributedCache.addCacheFile(new Path(uriInputPath).toUri(), conf);

FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
FileOutputFormat.setOutputPath(conf, new Path(outputPath));

conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);

JobClient.runJob(conf);

It seems like I should be providing a URL to the JobTracker somewhere, but I
can't figure out where to provide the information.

-- 
Larry Compton


Re: JobClient question

2008-07-11 Thread Shengkai Zhu
You should provide JobTracker address and port through configuration.


On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
>
> I'm coming up to speed on the Hadoop APIs. I need to be able to invoke a
> job
> from within a Java application (as opposed to running from the command-line
> "hadoop" executable). The JobConf and JobClient appear to support this and
> I've written a test program to configure and run a job. However, the job
> doesn't appear to be submitted to the JobTracker. Here's a code excerpt
> from
> my client...
>
>String rdfInputPath = args[0];
>String outputPath = args[1];
>String uriInputPath = args[2];
>String jarPath = args[3];
>
>JobConf conf = new JobConf(MaterializeMap.class);
>conf.setJobName("materialize");
>
>conf.setOutputKeyClass(Text.class);
>conf.setOutputValueClass(Text.class);
>
>conf.setMapperClass(MaterializeMapper.class);
>conf.setCombinerClass(MaterializeReducer.class);
>conf.setReducerClass(MaterializeReducer.class);
>conf.setJar(jarPath);
>
>DistributedCache.addCacheFile(new Path(uriInputPath).toUri(), conf);
>
>FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
>FileOutputFormat.setOutputPath(conf, new Path(outputPath));
>
>conf.setInputFormat(TextInputFormat.class);
>conf.setOutputFormat(TextOutputFormat.class);
>
>JobClient.runJob(conf);
>
> It seems like I should be providing a URL to the JobTracker somewhere, but
> I
> can't figure out where to provide the information.
>
> --
> Larry Compton
>



-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: JobClient question

2008-07-11 Thread Larry Compton
Thanks. Is this the correct syntax?

conf.set("mapred.job.tracker", "localhost:54311");

It does appear to be communicating with the JobTracker now, but I get the
following stack trace. Is there anything else that needs to be done to
configure the job?

Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
java.io.IOException:
/home/larry/pkg/hadoop/hdfs/mapred/system/job_20080714_0001/job.xml: No
such file or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy0.submitJob(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy0.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at jobclient.MaterializeMain.main(MaterializeMain.java:44)


On Fri, Jul 11, 2008 at 11:41 AM, Shengkai Zhu <[EMAIL PROTECTED]> wrote:

> You should provide JobTracker address and port through configuration.
>
>
> On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
> >
> > I'm coming up to speed on the Hadoop APIs. I need to be able to invoke a
> > job
> > from within a Java application (as opposed to running from the
> command-line
> > "hadoop" executable). The JobConf and JobClient appear to support this
> and
> > I've written a test program to configure and run a job. However, the job
> > doesn't appear to be submitted to the JobTracker. Here's a code excerpt
> > from
> > my client...
> >
> >String rdfInputPath = args[0];
> >String outputPath = args[1];
> >String uriInputPath = args[2];
> >String jarPath = args[3];
> >
> >JobConf conf = new JobConf(MaterializeMap.class);
> >conf.setJobName("materialize");
> >
> >conf.setOutputKeyClass(Text.class);
> >conf.setOutputValueClass(Text.class);
> >
> >conf.setMapperClass(MaterializeMapper.class);
> >conf.setCombinerClass(MaterializeReducer.class);
> >conf.setReducerClass(MaterializeReducer.class);
> >conf.setJar(jarPath);
> >
> >DistributedCache.addCacheFile(new Path(uriInputPath).toUri(),
> conf);
> >
> >FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
> >FileOutputFormat.setOutputPath(conf, new Path(outputPath));
> >
> >conf.setInputFormat(TextInputFormat.class);
> >conf.setOutputFormat(TextOutputFormat.class);
> >
> >JobClient.runJob(conf);
> >
> > It seems like I should be providing a URL to the JobTracker somewhere,
> but
> > I
> > can't figure out where to provide the information.
> >
> > --
> > Larry Compton
> >
>
>
>
> --
>
> 朱盛凯
>
> Jash Zhu
>
> 复旦大学软件学院
>
> Software School, Fudan University
>



-- 
Larry Compton


Re: JobClient question

2008-07-11 Thread Shengkai Zhu
Yes, you have already invoke he submitJob method through RPC.

But you have no configuration to describe your hadoop system dir, which is
default set "/tmp/hadoop/mapred/system"
So in your client program, it saved job.xml under the default dir.

But in JobTracker, your configuration make the system
dir "/home/larry/pkg/hadoop/hdfs/mapred/system/"
So you can't find the job.xml under it.


On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
>
> Thanks. Is this the correct syntax?
>
> conf.set("mapred.job.tracker", "localhost:54311");
>
> It does appear to be communicating with the JobTracker now, but I get the
> following stack trace. Is there anything else that needs to be done to
> configure the job?
>
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> /home/larry/pkg/hadoop/hdfs/mapred/system/job_20080714_0001/job.xml: No
> such file or directory
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
>at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
>at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
>at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
>at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
>at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:585)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>
>at org.apache.hadoop.ipc.Client.call(Client.java:557)
>at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>at $Proxy0.submitJob(Unknown Source)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
>
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:585)
>at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>at
>
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>at $Proxy0.submitJob(Unknown Source)
>at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
>at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
>at jobclient.MaterializeMain.main(MaterializeMain.java:44)
>
>
> On Fri, Jul 11, 2008 at 11:41 AM, Shengkai Zhu <[EMAIL PROTECTED]>
> wrote:
>
> > You should provide JobTracker address and port through configuration.
> >
> >
> > On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
> > >
> > > I'm coming up to speed on the Hadoop APIs. I need to be able to invoke
> a
> > > job
> > > from within a Java application (as opposed to running from the
> > command-line
> > > "hadoop" executable). The JobConf and JobClient appear to support this
> > and
> > > I've written a test program to configure and run a job. However, the
> job
> > > doesn't appear to be submitted to the JobTracker. Here's a code excerpt
> > > from
> > > my client...
> > >
> > >String rdfInputPath = args[0];
> > >String outputPath = args[1];
> > >String uriInputPath = args[2];
> > >String jarPath = args[3];
> > >
> > >JobConf conf = new JobConf(MaterializeMap.class);
> > >conf.setJobName("materialize");
> > >
> > >conf.setOutputKeyClass(Text.class);
> > >conf.setOutputValueClass(Text.class);
> > >
> > >conf.setMapperClass(MaterializeMapper.class);
> > >conf.setCombinerClass(MaterializeReducer.class);
> > >conf.setReducerClass(MaterializeReducer.class);
> > >conf.setJar(jarPath);
> > >
> > >DistributedCache.addCacheFile(new Path(uriInputPath).toUri(),
> > conf);
> > >
> > >FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
> > >FileOutputFormat.setOutputPath(conf, new Path(outputPath));
> > >
> > >conf.setInputFormat(TextInputFormat.class);
> > >conf.setOutputFormat(TextOutputFormat.class);
> > >
> > >JobClient.runJob(conf);
> > >
> > > It seems like I should be providing a URL to the JobTracker somewhere,
> > but
> > > I
> > > can't figure out where to provide the information.
> > >
> > > --
> > > Larry Compton
> > >
> >
> >
> >
> > --
> >
> > 朱盛凯
> >
> > Jash Zhu
> >
> > 复旦大学软件学院
> >
> > Software School, Fudan University
> >
>
>
>
> --
> Larry Compton
>



-- 

朱盛凯

Jash Zhu

复旦大学软件学院

Software School, Fudan University


Re: JobClient question

2008-07-11 Thread Matt Kent
The best way to configure all that stuff is in hadoop-site.xml, which
lives in the hadoop conf directory. Make sure that directory is on the
classpath of your application.

On Fri, 2008-07-11 at 11:55 -0400, Larry Compton wrote:
> Thanks. Is this the correct syntax?
> 
> conf.set("mapred.job.tracker", "localhost:54311");
> 
> It does appear to be communicating with the JobTracker now, but I get the
> following stack trace. Is there anything else that needs to be done to
> configure the job?
> 
> Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> java.io.IOException:
> /home/larry/pkg/hadoop/hdfs/mapred/system/job_20080714_0001/job.xml: No
> such file or directory
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
> at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
> at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
> at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
> at org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
> at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
> 
> at org.apache.hadoop.ipc.Client.call(Client.java:557)
> at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> at $Proxy0.submitJob(Unknown Source)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> at
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> at $Proxy0.submitJob(Unknown Source)
> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
> at jobclient.MaterializeMain.main(MaterializeMain.java:44)
> 
> 
> On Fri, Jul 11, 2008 at 11:41 AM, Shengkai Zhu <[EMAIL PROTECTED]> wrote:
> 
> > You should provide JobTracker address and port through configuration.
> >
> >
> > On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
> > >
> > > I'm coming up to speed on the Hadoop APIs. I need to be able to invoke a
> > > job
> > > from within a Java application (as opposed to running from the
> > command-line
> > > "hadoop" executable). The JobConf and JobClient appear to support this
> > and
> > > I've written a test program to configure and run a job. However, the job
> > > doesn't appear to be submitted to the JobTracker. Here's a code excerpt
> > > from
> > > my client...
> > >
> > >String rdfInputPath = args[0];
> > >String outputPath = args[1];
> > >String uriInputPath = args[2];
> > >String jarPath = args[3];
> > >
> > >JobConf conf = new JobConf(MaterializeMap.class);
> > >conf.setJobName("materialize");
> > >
> > >conf.setOutputKeyClass(Text.class);
> > >conf.setOutputValueClass(Text.class);
> > >
> > >conf.setMapperClass(MaterializeMapper.class);
> > >conf.setCombinerClass(MaterializeReducer.class);
> > >conf.setReducerClass(MaterializeReducer.class);
> > >conf.setJar(jarPath);
> > >
> > >DistributedCache.addCacheFile(new Path(uriInputPath).toUri(),
> > conf);
> > >
> > >FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
> > >FileOutputFormat.setOutputPath(conf, new Path(outputPath));
> > >
> > >conf.setInputFormat(TextInputFormat.class);
> > >conf.setOutputFormat(TextOutputFormat.class);
> > >
> > >JobClient.runJob(conf);
> > >
> > > It seems like I should be providing a URL to the JobTracker somewhere,
> > but
> > > I
> > > can't figure out where to provide the information.
> > >
> > > --
> > > Larry Compton
> > >
> >
> >
> >
> > --
> >
> > 朱盛凯
> >
> > Jash Zhu
> >
> > 复旦大学软件学院
> >
> > Software School, Fudan University
> >
> 
> 
> 



Re: Compiling Word Count in C++ : Hadoop Pipes

2008-07-11 Thread Sandy
hadoop-0.17.0 should work. I took a closer look at your error message. It
seems you need to change permission on some of your files

Try:

 chmod 755 /home/jobs/hadoop-0.17.0/src/examples/pipes/configure


At this point you probably will get another "build failed" message, because
you need to do the same thing on another file (I don't remember it off the
top of my head). But you can find it out by inspecting this part of the
error message:

BUILD FAILED
/home/jobs/hadoop-0.17.0/build.xml:1040: Execute failed:
java.io.IOException: Cannot run program
"/home/jobs/hadoop-0.17.0/src/examples/pipes/configure" (in directory
"/home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/examples/pipes"):
java.io.IOException: error=13, Permission denied

This means that the file that you can't run is:
"/home/jobs/hadoop-0.17.0/src/examples/pipes/configure"

due to permission issues. a chmod 755 will fix this. you'll need to do this
with any "permission denied" message that you get associated with this.

Hope this helps!

-SM
On Thu, Jul 10, 2008 at 10:03 PM, chaitanya krishna <
[EMAIL PROTECTED]> wrote:

> I'm using hadoop-0.17.0. Should I be using a more latest version?
> Please tell me which version did you use?
>
> On Fri, Jul 11, 2008 at 2:35 AM, Sandy <[EMAIL PROTECTED]> wrote:
>
> > One last thing:
> >
> > If that doesn't work, try following the instructions on the ubuntu
> setting
> > up hadoop tutorial. Even if you aren't running ubuntu, I think it may be
> > possible to use those instructions to set up things properly. That's what
> I
> > eventually did.
> >
> > Link is here:
> >
> >
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
> <
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
> >
> >
> > -SM
> >
> > On Thu, Jul 10, 2008 at 4:02 PM, Sandy <[EMAIL PROTECTED]>
> wrote:
> >
> > > So, I had run into a similar issue. What version of Hadoop are you
> using?
> > >
> > > Make sure you are using the latest version of hadoop. That actually
> fixed
> > > it for me. There was something wrong with the build.xml file in earlier
> > > versions that prevented me from being able to get it to work properly.
> > Once
> > > I upgraded to the latest, it went away.
> > >
> > > Hope this helps!
> > >
> > > -SM
> > >
> > >
> > > On Thu, Jul 10, 2008 at 1:39 PM, chaitanya krishna <
> > > [EMAIL PROTECTED]> wrote:
> > >
> > >> Hi,
> > >>
> > >>  I faced the similar problem as Sandy. But this time I even had the
> jdk
> > >> set
> > >> properly.
> > >>
> > >> when i executed:
> > >> ant -Dcompile.c++=yes examples
> > >>
> > >> the following was displayed:
> > >>
> > >> Buildfile: build.xml
> > >>
> > >> clover.setup:
> > >>
> > >> clover.info:
> > >> [echo]
> > >> [echo]  Clover not found. Code coverage reports disabled.
> > >> [echo]
> > >>
> > >> clover:
> > >>
> > >> init:
> > >> [touch] Creating /tmp/null358480626
> > >>   [delete] Deleting: /tmp/null358480626
> > >>  [exec] svn: '.' is not a working copy
> > >> [exec] svn: '.' is not a working copy
> > >>
> > >> record-parser:
> > >>
> > >> compile-rcc-compiler:
> > >>
> > >> compile-core-classes:
> > >>[javac] Compiling 2 source files to
> > >> /home/jobs/hadoop-0.17.0/build/classes
> > >>
> > >> compile-core-native:
> > >>
> > >> check-c++-makefiles:
> > >>
> > >> create-c++-pipes-makefile:
> > >>
> > >> BUILD FAILED
> > >> /home/jobs/hadoop-0.17.0/build.xml:1017: Execute failed:
> > >> java.io.IOException: Cannot run program
> > >> "/home/jobs/hadoop-0.17.0/src/c++/pipes/configure" (in directory
> > >> "/home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/pipes"):
> > >> java.io.IOException: error=13, Permission denied
> > >>
> > >>
> > >>
> > >> when,as suggested by Lohith, following was executed,
> > >>
> > >> ant -Dcompile.c++=yes compile-c++-examples
> > >>
> > >> the following was displayed
> > >> Buildfile: build.xml
> > >>
> > >> init:
> > >>[touch] Creating /tmp/null1037468845
> > >>   [delete] Deleting: /tmp/null1037468845
> > >>  [exec] svn: '.' is not a working copy
> > >> [exec] svn: '.' is not a working copy
> > >>
> > >> check-c++-makefiles:
> > >>
> > >> create-c++-examples-pipes-makefile:
> > >>[mkdir] Created dir:
> > >> /home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/examples/pipes
> > >>
> > >> BUILD FAILED
> > >> /home/jobs/hadoop-0.17.0/build.xml:1040: Execute failed:
> > >> java.io.IOException: Cannot run program
> > >> "/home/jobs/hadoop-0.17.0/src/examples/pipes/configure" (in directory
> > >>
> > "/home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/examples/pipes"):
> > >> java.io.IOException: error=13, Permission denied
> > >>
> > >> Total time: 0 seconds
> > >>
> > >>
> > >> Please help me out with this problem.
> > >>
> > >> Thank you.
> > >>
> > >> V.V.Chaitanya Krishna
> > >>
> > >>
> > >> On Thu, Jun 26, 2008 at 9:49 PM, San

Issue adding hadoop-core.jar to project's build path

2008-07-11 Thread Khanh Nguyen
Hello,

I am new to Hadoop. I am currently working on a small project for the
Internet Archive's web crawler, Heritrix. My work requires me to use
Hadoop but I am running in a strange problem. As soon as I add
hadoop-core.jar to my project build path, junit tests in Heritrix ran
into problem with loggers. Please help.

If it helps, here is the stack trace

java.lang.ExceptionInInitializerError
at org.archive.net.LaxURI.decode(LaxURI.java:125)
at org.archive.net.LaxURI.decode(LaxURI.java:112)
at org.archive.net.LaxURI.getPath(LaxURI.java:96)
at 
org.archive.modules.extractor.ExtractorHTML.isHtmlExpectedHere(ExtractorHTML.java:686)
at 
org.archive.modules.extractor.ExtractorHTML.shouldExtract(ExtractorHTML.java:543)
at 
org.archive.modules.extractor.ContentExtractor.shouldProcess(ContentExtractor.java:74)
at org.archive.modules.Processor.process(Processor.java:121)
at 
org.archive.modules.extractor.StringExtractorTestBase.testOne(StringExtractorTestBase.java:69)
at 
org.archive.modules.extractor.StringExtractorTestBase.testExtraction(StringExtractorTestBase.java:48)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at junit.framework.TestCase.runTest(TestCase.java:164)
at junit.framework.TestCase.runBare(TestCase.java:130)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:120)
at junit.framework.TestSuite.runTest(TestSuite.java:230)
at junit.framework.TestSuite.run(TestSuite.java:225)
at 
org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at 
org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at 
org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Caused by: org.apache.commons.logging.LogConfigurationException:
org.apache.commons.logging.LogConfigurationException: No suitable Log
constructor [Ljava.lang.Class;@1ea2dfe for
org.apache.commons.logging.impl.Log4JLogger (Caused by
java.lang.NoClassDefFoundError: org/apache/log4j/Category) (Caused by
org.apache.commons.logging.LogConfigurationException: No suitable Log
constructor [Ljava.lang.Class;@1ea2dfe for
org.apache.commons.logging.impl.Log4JLogger (Caused by
java.lang.NoClassDefFoundError: org/apache/log4j/Category))
at 
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:543)
at 
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
at 
org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
at 
org.apache.commons.httpclient.util.EncodingUtil.(EncodingUtil.java:54)
... 27 more
Caused by: org.apache.commons.logging.LogConfigurationException: No
suitable Log constructor [Ljava.lang.Class;@1ea2dfe for
org.apache.commons.logging.impl.Log4JLogger (Caused by
java.lang.NoClassDefFoundError: org/apache/log4j/Category)
at 
org.apache.commons.logging.impl.LogFactoryImpl.getLogConstructor(LogFactoryImpl.java:413)
at 
org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
... 31 more
Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/Category
at java.lang.Class.getDeclaredConstructors0(Native Method)
at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
at java.lang.Class.getConstructor0(Class.java:2699)
at java.lang.Class.getConstructor(Class.java:1657)
at 
org.apache.commons.logging.impl.LogFactoryImpl.getLogConstructor(LogFactoryImpl.java:410)
... 32 more
Caused by: java.lang.ClassNotFoundException: org.apache.log4j.Category
at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:276)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang

Re: JobClient question

2008-07-11 Thread Larry Compton
Adding the directory to my classpath worked!

Thank you both for the prompt assistance.

On Fri, Jul 11, 2008 at 12:33 PM, Matt Kent <[EMAIL PROTECTED]> wrote:

> The best way to configure all that stuff is in hadoop-site.xml, which
> lives in the hadoop conf directory. Make sure that directory is on the
> classpath of your application.
>
> On Fri, 2008-07-11 at 11:55 -0400, Larry Compton wrote:
> > Thanks. Is this the correct syntax?
> >
> > conf.set("mapred.job.tracker", "localhost:54311");
> >
> > It does appear to be communicating with the JobTracker now, but I get the
> > following stack trace. Is there anything else that needs to be done to
> > configure the job?
> >
> > Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
> > java.io.IOException:
> > /home/larry/pkg/hadoop/hdfs/mapred/system/job_20080714_0001/job.xml:
> No
> > such file or directory
> > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
> > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
> > at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
> > at
> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
> > at
> org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
> > at
> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:585)
> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
> >
> > at org.apache.hadoop.ipc.Client.call(Client.java:557)
> > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
> > at $Proxy0.submitJob(Unknown Source)
> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> > at
> >
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> > at java.lang.reflect.Method.invoke(Method.java:585)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
> > at
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
> > at $Proxy0.submitJob(Unknown Source)
> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
> > at jobclient.MaterializeMain.main(MaterializeMain.java:44)
> >
> >
> > On Fri, Jul 11, 2008 at 11:41 AM, Shengkai Zhu <[EMAIL PROTECTED]>
> wrote:
> >
> > > You should provide JobTracker address and port through configuration.
> > >
> > >
> > > On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
> > > >
> > > > I'm coming up to speed on the Hadoop APIs. I need to be able to
> invoke a
> > > > job
> > > > from within a Java application (as opposed to running from the
> > > command-line
> > > > "hadoop" executable). The JobConf and JobClient appear to support
> this
> > > and
> > > > I've written a test program to configure and run a job. However, the
> job
> > > > doesn't appear to be submitted to the JobTracker. Here's a code
> excerpt
> > > > from
> > > > my client...
> > > >
> > > >String rdfInputPath = args[0];
> > > >String outputPath = args[1];
> > > >String uriInputPath = args[2];
> > > >String jarPath = args[3];
> > > >
> > > >JobConf conf = new JobConf(MaterializeMap.class);
> > > >conf.setJobName("materialize");
> > > >
> > > >conf.setOutputKeyClass(Text.class);
> > > >conf.setOutputValueClass(Text.class);
> > > >
> > > >conf.setMapperClass(MaterializeMapper.class);
> > > >conf.setCombinerClass(MaterializeReducer.class);
> > > >conf.setReducerClass(MaterializeReducer.class);
> > > >conf.setJar(jarPath);
> > > >
> > > >DistributedCache.addCacheFile(new Path(uriInputPath).toUri(),
> > > conf);
> > > >
> > > >FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
> > > >FileOutputFormat.setOutputPath(conf, new Path(outputPath));
> > > >
> > > >conf.setInputFormat(TextInputFormat.class);
> > > >conf.setOutputFormat(TextOutputFormat.class);
> > > >
> > > >JobClient.runJob(conf);
> > > >
> > > > It seems like I should be providing a URL to the JobTracker
> somewhere,
> > > but
> > > > I
> > > > can't figure out where to provide the information.
> > > >
> > > > --
> > > > Larry Compton
> > > >
> > >
> > >
> > >
> > > --
> > >
> > > 朱盛凯
> > >
> > > Jash Zhu
> > >
> > > 复旦大学软件学院
> > >
> > > Software School, Fudan University
> > >
> >
> >
> >
>
>


-- 
Larry Compton


hadoop Writeable class conversions

2008-07-11 Thread Sandy
Hello,

Just a quick question. Suppose I have a value y that is of type
LongWritable. How can I convert it to a long? I tried casting, and I also
looked at the LongWritable class and did not see a definition for a suitable
conversion function. I also was not able to find a solution through the
forum archives. Could someone please point me in the right direction?

Much thanks.

-SM


Is it possible to input two different files under same mapper

2008-07-11 Thread Muhammad Ali Amer

HI,
My requirement is to compare the contents of one very large file (GB  
to TB size) with a bunch of smaller files (100s of MB to GB  sizes).  
Is there a way I can give the mapper the 1st file independently of the  
remaining bunch?

Amer


Re: JobClient question

2008-07-11 Thread Larry Compton
For anyone with a similar issue, you can get away with not having the "conf"
directory in your classpath, if you set some configuration properties,
something similar to the following:
conf.set("fs.default.name", "hdfs://localhost:54310");
conf.set("mapred.job.tracker", "localhost:54311");
conf.set("hadoop.tmp.dir", System.getProperty("user.home")
+ "/pkg/hadoop/hdfs");


On Fri, Jul 11, 2008 at 1:51 PM, Larry Compton <[EMAIL PROTECTED]>
wrote:

> Adding the directory to my classpath worked!
>
> Thank you both for the prompt assistance.
>
>
> On Fri, Jul 11, 2008 at 12:33 PM, Matt Kent <[EMAIL PROTECTED]> wrote:
>
>> The best way to configure all that stuff is in hadoop-site.xml, which
>> lives in the hadoop conf directory. Make sure that directory is on the
>> classpath of your application.
>>
>> On Fri, 2008-07-11 at 11:55 -0400, Larry Compton wrote:
>> > Thanks. Is this the correct syntax?
>> >
>> > conf.set("mapred.job.tracker", "localhost:54311");
>> >
>> > It does appear to be communicating with the JobTracker now, but I get
>> the
>> > following stack trace. Is there anything else that needs to be done to
>> > configure the job?
>> >
>> > Exception in thread "main" org.apache.hadoop.ipc.RemoteException:
>> > java.io.IOException:
>> > /home/larry/pkg/hadoop/hdfs/mapred/system/job_20080714_0001/job.xml:
>> No
>> > such file or directory
>> > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
>> > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
>> > at
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
>> > at
>> org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
>> > at
>> org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
>> > at
>> org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:585)
>> > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
>> > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)
>> >
>> > at org.apache.hadoop.ipc.Client.call(Client.java:557)
>> > at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
>> > at $Proxy0.submitJob(Unknown Source)
>> > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> > at
>> >
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>> > at
>> >
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>> > at java.lang.reflect.Method.invoke(Method.java:585)
>> > at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
>> > at
>> >
>> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
>> > at $Proxy0.submitJob(Unknown Source)
>> > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
>> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
>> > at jobclient.MaterializeMain.main(MaterializeMain.java:44)
>> >
>> >
>> > On Fri, Jul 11, 2008 at 11:41 AM, Shengkai Zhu <[EMAIL PROTECTED]>
>> wrote:
>> >
>> > > You should provide JobTracker address and port through configuration.
>> > >
>> > >
>> > > On 7/11/08, Larry Compton <[EMAIL PROTECTED]> wrote:
>> > > >
>> > > > I'm coming up to speed on the Hadoop APIs. I need to be able to
>> invoke a
>> > > > job
>> > > > from within a Java application (as opposed to running from the
>> > > command-line
>> > > > "hadoop" executable). The JobConf and JobClient appear to support
>> this
>> > > and
>> > > > I've written a test program to configure and run a job. However, the
>> job
>> > > > doesn't appear to be submitted to the JobTracker. Here's a code
>> excerpt
>> > > > from
>> > > > my client...
>> > > >
>> > > >String rdfInputPath = args[0];
>> > > >String outputPath = args[1];
>> > > >String uriInputPath = args[2];
>> > > >String jarPath = args[3];
>> > > >
>> > > >JobConf conf = new JobConf(MaterializeMap.class);
>> > > >conf.setJobName("materialize");
>> > > >
>> > > >conf.setOutputKeyClass(Text.class);
>> > > >conf.setOutputValueClass(Text.class);
>> > > >
>> > > >conf.setMapperClass(MaterializeMapper.class);
>> > > >conf.setCombinerClass(MaterializeReducer.class);
>> > > >conf.setReducerClass(MaterializeReducer.class);
>> > > >conf.setJar(jarPath);
>> > > >
>> > > >DistributedCache.addCacheFile(new Path(uriInputPath).toUri(),
>> > > conf);
>> > > >
>> > > >FileInputFormat.setInputPaths(conf, new Path(rdfInputPath));
>> > > >FileOutputFormat.setOutputPath(conf

Re: Namenode Exceptions with S3

2008-07-11 Thread slitz
I've been learning a lot from this thread, and Tom just helped me
understanding some things about S3 and HDFS, thank you.
To wrap everything up, if we want to use S3 with EC2 we can:

a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket
  -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode:
java.lang.RuntimeException: Not a host:port pair: X
b) Use HDFS as the default FS, specifying S3 only as input for the first Job
and output for the last(assuming one has multiple jobs on same data)
  -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733


So, in my case i cannot use S3 at all for now because of these 2 problems.
Any advice?

slitz

On Fri, Jul 11, 2008 at 4:31 PM, Lincoln Ritter <[EMAIL PROTECTED]>
wrote:

> Thanks Tom!
>
> Your explanation makes things a lot clearer.  I think that changing
> the 'fs.default.name' to something like 'dfs.namenode.address' would
> certainly be less confusing since it would clarify the purpose of
> these values.
>
> -lincoln
>
> --
> lincolnritter.com
>
>
>
> On Fri, Jul 11, 2008 at 4:21 AM, Tom White <[EMAIL PROTECTED]> wrote:
> > On Thu, Jul 10, 2008 at 10:06 PM, Lincoln Ritter
> > <[EMAIL PROTECTED]> wrote:
> >> Thank you, Tom.
> >>
> >> Forgive me for being dense, but I don't understand your reply:
> >>
> >
> > Sorry! I'll try to explain it better (see below).
> >
> >>
> >> Do you mean that it is possible to use the Hadoop daemons with S3 but
> >> the default filesystem must be HDFS?
> >
> > The HDFS daemons use the value of "fs.default.name" to set the
> > namenode host and port, so if you set it to a s3 URI, you can't run
> > the HDFS daemons. So in this case you would use the start-mapred.sh
> > script instead of start-all.sh.
> >
> >> If that is the case, can I
> >> specify the output filesystem on a per-job basis and can that be an S3
> >> FS?
> >
> > Yes, that's exactly how you do it.
> >
> >>
> >> Also, is there a particular reason to not allow S3 as the default FS?
> >
> > You can allow S3 as the default FS, it's just that then you can't run
> > HDFS at all in this case. You would only do this if you don't want to
> > use HDFS at all, for example, if you were running a MapReduce job
> > which read from S3 and wrote to S3.
> >
> > It might be less confusing if the HDFS daemons didn't use
> > fs.default.name to define the namenode host and port. Just like
> > mapred.job.tracker defines the host and port for the jobtracker,
> > dfs.namenode.address (or similar) could define the namenode. Would
> > this be a good change to make?
> >
> > Tom
> >
>


Re: Issue adding hadoop-core.jar to project's build path

2008-07-11 Thread tim robertson
You need a Log4J on the classpath I think..."Caused
by java.lang.NoClassDefFoundError: org/apache/log4j/Category"

Cheers

Tim


On Fri, Jul 11, 2008 at 7:13 PM, Khanh Nguyen <[EMAIL PROTECTED]> wrote:

> Hello,
>
> I am new to Hadoop. I am currently working on a small project for the
> Internet Archive's web crawler, Heritrix. My work requires me to use
> Hadoop but I am running in a strange problem. As soon as I add
> hadoop-core.jar to my project build path, junit tests in Heritrix ran
> into problem with loggers. Please help.
>
> If it helps, here is the stack trace
>
> java.lang.ExceptionInInitializerError
>at org.archive.net.LaxURI.decode(LaxURI.java:125)
>at org.archive.net.LaxURI.decode(LaxURI.java:112)
>at org.archive.net.LaxURI.getPath(LaxURI.java:96)
>at
> org.archive.modules.extractor.ExtractorHTML.isHtmlExpectedHere(ExtractorHTML.java:686)
>at
> org.archive.modules.extractor.ExtractorHTML.shouldExtract(ExtractorHTML.java:543)
>at
> org.archive.modules.extractor.ContentExtractor.shouldProcess(ContentExtractor.java:74)
>at org.archive.modules.Processor.process(Processor.java:121)
>at
> org.archive.modules.extractor.StringExtractorTestBase.testOne(StringExtractorTestBase.java:69)
>at
> org.archive.modules.extractor.StringExtractorTestBase.testExtraction(StringExtractorTestBase.java:48)
>at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at junit.framework.TestCase.runTest(TestCase.java:164)
>at junit.framework.TestCase.runBare(TestCase.java:130)
>at junit.framework.TestResult$1.protect(TestResult.java:106)
>at junit.framework.TestResult.runProtected(TestResult.java:124)
>at junit.framework.TestResult.run(TestResult.java:109)
>at junit.framework.TestCase.run(TestCase.java:120)
>at junit.framework.TestSuite.runTest(TestSuite.java:230)
>at junit.framework.TestSuite.run(TestSuite.java:225)
>at
> org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
>at
> org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
>at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
>at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
>at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
>at
> org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
> Caused by: org.apache.commons.logging.LogConfigurationException:
> org.apache.commons.logging.LogConfigurationException: No suitable Log
> constructor [Ljava.lang.Class;@1ea2dfe for
> org.apache.commons.logging.impl.Log4JLogger (Caused by
> java.lang.NoClassDefFoundError: org/apache/log4j/Category) (Caused by
> org.apache.commons.logging.LogConfigurationException: No suitable Log
> constructor [Ljava.lang.Class;@1ea2dfe for
> org.apache.commons.logging.impl.Log4JLogger (Caused by
> java.lang.NoClassDefFoundError: org/apache/log4j/Category))
>at
> org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:543)
>at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:235)
>at
> org.apache.commons.logging.impl.LogFactoryImpl.getInstance(LogFactoryImpl.java:209)
>at org.apache.commons.logging.LogFactory.getLog(LogFactory.java:351)
>at
> org.apache.commons.httpclient.util.EncodingUtil.(EncodingUtil.java:54)
>... 27 more
> Caused by: org.apache.commons.logging.LogConfigurationException: No
> suitable Log constructor [Ljava.lang.Class;@1ea2dfe for
> org.apache.commons.logging.impl.Log4JLogger (Caused by
> java.lang.NoClassDefFoundError: org/apache/log4j/Category)
>at
> org.apache.commons.logging.impl.LogFactoryImpl.getLogConstructor(LogFactoryImpl.java:413)
>at
> org.apache.commons.logging.impl.LogFactoryImpl.newInstance(LogFactoryImpl.java:529)
>... 31 more
> Caused by: java.lang.NoClassDefFoundError: org/apache/log4j/Category
>at java.lang.Class.getDeclaredConstructors0(Native Method)
>at java.lang.Class.privateGetDeclaredConstructors(Class.java:2389)
>at java.lang.Class.getConstructor0(Class.java:2699)
>at java.lang.Class.getConstructor(Class.java:1657)
>at
> org.apache.commons.logging.impl.LogFactoryImpl.getLogConstructor(LogFactoryImpl.java:410)
>... 32 more
> Caused by: java.lang.ClassNotFoundException: org.apache.log4j.Category
>at java.net.URLClassLoader$1.run(URLClassLoader.java:200)
>at java.security.AccessController.doPri

Re: hadoop Writeable class conversions

2008-07-11 Thread Chris Douglas

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/LongWritable.html#get()

-C

On Jul 11, 2008, at 12:33 PM, Sandy wrote:


Hello,

Just a quick question. Suppose I have a value y that is of type
LongWritable. How can I convert it to a long? I tried casting, and I  
also
looked at the LongWritable class and did not see a definition for a  
suitable
conversion function. I also was not able to find a solution through  
the

forum archives. Could someone please point me in the right direction?

Much thanks.

-SM




question on tasktracker status

2008-07-11 Thread Mori Bellamy

hey all,
what dictates the "% complete" bars for maptasks and reduce tasks? i  
ask because, for one of my map jobs, the tasks hang at 0% for a long  
time until they jump to 100%.


thanks!


Re: Is it possible to input two different files under same mapper

2008-07-11 Thread Mori Bellamy

Hey Amer,
It sounds to me like you're going to have to write your own input  
format (or atleast modify an existing one). Take a look here:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html

I'm not sure how you'd go about doing this, but i hope this helps you.

(Also, have you considered preprocessing your input so that any  
arbitrary mapper can know whether or not its looking at a line from  
the "large file"?)

On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:


HI,
My requirement is to compare the contents of one very large file (GB  
to TB size) with a bunch of smaller files (100s of MB to GB  sizes).  
Is there a way I can give the mapper the 1st file independently of  
the remaining bunch?

Amer




Re: hadoop Writeable class conversions

2008-07-11 Thread Sandy
... WOW. I can't believe I missed that! Thanks. I think I've been staring at
my code for too long.

-SM

On Fri, Jul 11, 2008 at 3:16 PM, Chris Douglas <[EMAIL PROTECTED]>
wrote:

>
> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/io/LongWritable.html#get()
>
> -C
>
>
> On Jul 11, 2008, at 12:33 PM, Sandy wrote:
>
>  Hello,
>>
>> Just a quick question. Suppose I have a value y that is of type
>> LongWritable. How can I convert it to a long? I tried casting, and I also
>> looked at the LongWritable class and did not see a definition for a
>> suitable
>> conversion function. I also was not able to find a solution through the
>> forum archives. Could someone please point me in the right direction?
>>
>> Much thanks.
>>
>> -SM
>>
>
>


Re: question on tasktracker status

2008-07-11 Thread Arun C Murthy


On Jul 11, 2008, at 1:35 PM, Mori Bellamy wrote:


hey all,
what dictates the "% complete" bars for maptasks and reduce tasks?  
i ask because, for one of my map jobs, the tasks hang at 0% for a  
long time until they jump to 100%.




Maps -> amount of input consumed (this is the normal case when you  
are processing data on HDFS)
Reduces -> Shuffle is 0-33% (shuffle is the phase where you copy  
output of the maps), Merge is 33-66% (here sorted map-outputs are  
being merged), rest is reduce (where user's Reducer.reduce methods  
are being invoked).


Arun


Re: Is Hadoop Really the right framework for me?

2008-07-11 Thread Sandy
Much thanks! I am going to take a look.

For the meantime I think there is a work around.

>From what it appears, at least from running hadoop locally, the mappers are
assigned to each line of a single file by default in the WordCount.java
file. If one just cares about establishing the uniqueness of each line, w/o
needing the whole specific numbering (line 1, 2, etc. v. 3332 34234 42323)
one can just use the key value associated with that mapper (since it just
gives the offset). Since the offset is always increasing, and since a mapper
is always attached to one line, there is no worry about uniqueness. Of
course, this is dependent on the fact that a mapper will be attached to one
line.

Since it works (or appears to be working) on a local run of hadoop, I think
I can guarantee that a mapper will map to a single line in a distributed run
of hadoop, though all of this is grey-box speculation on my part. However,
considering my lack of understanding of how hadoop may actually work, I
wonder if this is a guarantee I can safely make. It is the "guarantee" that
I am curious about; even if it works on some files of a certain size, I
wonder if it will work on files of arbitrarily large size? I would love to
hear the insight of some of the more experienced users on this matter.

Thanks again,

-SM

On Thu, Jul 10, 2008 at 6:50 PM, lohit <[EMAIL PROTECTED]> wrote:

> Its not released yet. There are 2 options
> 1. download the un-released 0.18 branch from here
> http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18
> svn co 
> http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18branch-0.18
>
> 2. get the NLineInputFormat.java from
> http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.18/src/mapred/org/apache/hadoop/mapred/lib/NLineInputFormat.java
> copy it to your .mapred/lib directory, rebuild everything and try it
> out. I assume it should work, but I havent tried it out yet.
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Sandy <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Thursday, July 10, 2008 3:45:34 PM
> Subject: Re: Is Hadoop Really the right framework for me?
>
> Thank for the responses..
>
> Lohit and Mahadev: this sounds fantastic; however, where may I got hadoop
> 0.18? I went to http://hadoop.apache.org/core/releases.html
>
> But did not see a link for hadoop 0.18. After I did a brief search on
> google, it did not seem that Hadoop has been officially released yes. If
> this is indeed the case, when is the release date scheduled? In the
> meantime, could you please point me in the direction on where to acquire
> it?
> If it a better idea for me to wait for the release?
>
> Thank you kindly.
>
> -SM
>
> On Thu, Jul 10, 2008 at 5:18 PM, lohit <[EMAIL PROTECTED]> wrote:
>
> > Hello Sandy,
> >
> > If you are using hadoop 0.18, you can use NLineInputFormat input format
> to
> > get you job done. What this says is give exactly one line for each
> mapper.
> > In your mapper you might have to encode your keys something like
> > 
> > So output from your mapper would be key/value pair as ,1
> > Reducer would sum up all word:linenumber and in your reduce funtion, you
> > would have to extract the work, linenumber and its count. The delimiter
> ':'
> > should not be part of your word though.
> >
> > You might want to take a look at the example usage of NLineInputFormat
> from
> > this test src/test/org/apache/hadoop/mapred/lib/TestLineInputFormat.java
> >
> > HTH,
> > Lohit
> > - Original Message 
> > From: Sandy <[EMAIL PROTECTED]>
> > To: core-user@hadoop.apache.org
> > Sent: Thursday, July 10, 2008 2:47:21 PM
> > Subject: Is Hadoop Really the right framework for me?
> >
> > Hello,
> >
> > I have been posting on the forums for a couple of weeks now, and I really
> > appreciate all the help that I've been receiving. I am fairly new to
> Java,
> > and even newer to the Hadoop framework. While I am sufficiently impressed
> > with the Hadoop, quite a bit of the underlying functionality is masked to
> > the user (which, while I understand is the point of a Map Reduce
> Framework,
> > can be a touch frustrating for someone who is still trying to learn their
> > way around), and the documentation is sometimes difficult to navigate. I
> > have been thusfar unable to sufficiently find an answer to this question
> on
> > my own.
> >
> > My goal is to implement a fairly simple map reduce algorithm. My question
> > is, "Is Hadoop really the right framework to use for this algorithm?"
> >
> > I have one very large file containing multiple lines of text. I want to
> > assign a mapper job to each line. Furthermore, the mapper needs to be
> able
> > to know what line it is processing. If we were thinking about this in
> terms
> > of the Word Count Example, let's say we have a modification where we want
> > to
> > just see where the words came from, rather than just the count of the
> > words.
> >
> >
> > For this example, we have the file:
> >
> > Hello Wo

Re: Is it possible to input two different files under same mapper

2008-07-11 Thread Muhammad Ali Amer

Thanks Mori,
 So far I cannot touch the large file, its just a very very long  
string , and I have to "approximately" match smaller strings against  
it. I will give it a try with the FileSplit and see if I am not  
merging the two together.


On Jul 11, 2008, at 1:41 PM, Mori Bellamy wrote:


Hey Amer,
It sounds to me like you're going to have to write your own input  
format (or atleast modify an existing one). Take a look here:

http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html

I'm not sure how you'd go about doing this, but i hope this helps you.

(Also, have you considered preprocessing your input so that any  
arbitrary mapper can know whether or not its looking at a line from  
the "large file"?)

On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:


HI,
My requirement is to compare the contents of one very large file  
(GB to TB size) with a bunch of smaller files (100s of MB to GB   
sizes). Is there a way I can give the mapper the 1st file  
independently of the remaining bunch?

Amer





Muhammad Ali Amer
Center For Grid Technologies
Information Sciences Institute
USC Viterbi School Of Engg
Tel : (310) 448-8349



Re: Is it possible to input two different files under same mapper

2008-07-11 Thread Miles Osborne
why not just pass the large file name as an argument to your mappers?  each
mapper could then access that file as it saw fit, without having to go
through contortions.

Miles

2008/7/11 Muhammad Ali Amer <[EMAIL PROTECTED]>:

> Thanks Mori,
>  So far I cannot touch the large file, its just a very very long string ,
> and I have to "approximately" match smaller strings against it. I will give
> it a try with the FileSplit and see if I am not merging the two together.
>
> On Jul 11, 2008, at 1:41 PM, Mori Bellamy wrote:
>
>  Hey Amer,
>> It sounds to me like you're going to have to write your own input format
>> (or atleast modify an existing one). Take a look here:
>>
>> http://hadoop.apache.org/core/docs/current/api/org/apache/hadoop/mapred/FileSplit.html
>>
>> I'm not sure how you'd go about doing this, but i hope this helps you.
>>
>> (Also, have you considered preprocessing your input so that any arbitrary
>> mapper can know whether or not its looking at a line from the "large file"?)
>> On Jul 11, 2008, at 12:31 PM, Muhammad Ali Amer wrote:
>>
>>  HI,
>>> My requirement is to compare the contents of one very large file (GB to
>>> TB size) with a bunch of smaller files (100s of MB to GB  sizes). Is there a
>>> way I can give the mapper the 1st file independently of the remaining bunch?
>>> Amer
>>>
>>
>>
>>
> Muhammad Ali Amer
> Center For Grid Technologies
> Information Sciences Institute
> USC Viterbi School Of Engg
> Tel : (310) 448-8349
>
>


-- 
The University of Edinburgh is a charitable body, registered in Scotland,
with registration number SC005336.


Re: Namenode Exceptions with S3

2008-07-11 Thread Tom White
On Fri, Jul 11, 2008 at 9:09 PM, slitz <[EMAIL PROTECTED]> wrote:
> a) Use S3 only, without HDFS and configuring fs.default.name as s3://bucket
>  -> PROBLEM: we are getting ERROR org.apache.hadoop.dfs.NameNode:
> java.lang.RuntimeException: Not a host:port pair: X

What command are you using to start Hadoop?

> b) Use HDFS as the default FS, specifying S3 only as input for the first Job
> and output for the last(assuming one has multiple jobs on same data)
>  -> PROBLEM: https://issues.apache.org/jira/browse/HADOOP-3733

Yes, this is a problem. I've added a comment to the Jira description
describing a workaround.

Tom


How to add/remove slave nodes on run time

2008-07-11 Thread Kevin
Hi,

I searched a bit but could not find the answer. What is the right way
to add (and remove) new slave nodes on run time? Thank you.

-Kevin


Re: How to add/remove slave nodes on run time

2008-07-11 Thread lohit
To add new datanodes, use the same hadoop version already running on your 
cluster, the right config and start datanode on any node. The datanode would be 
configured to talk to the namenode by reading the configs and it would join the 
cluster. To remove datanode(s) you could decommission the datanode and once 
decommissioned just kill DataNode process. This is described in there 
http://wiki.apache.org/hadoop/FAQ#17

Thanks,
Lohit

- Original Message 
From: Kevin <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, July 11, 2008 3:43:41 PM
Subject: How to add/remove slave nodes on run time

Hi,

I searched a bit but could not find the answer. What is the right way
to add (and remove) new slave nodes on run time? Thank you.

-Kevin



Re: Hudson Patch Verifier's Output

2008-07-11 Thread Abdul Qadeer
According to JIRA 3653 is resolved.  I uploaded new patch file
for 3646 but I can not find "Submit Patch" option in the available
work flow options.  I was wondering why is it so?

Thanks,
Abdul Qadeer

On Thu, Jul 3, 2008 at 5:35 PM, Nigel Daley <[EMAIL PROTECTED]> wrote:

> A bug was introduced by HADOOP-3480.  HADOOP-3653 will fix it.
>
> Nige
>
>
> On Jul 3, 2008, at 5:24 PM, Abdul Qadeer wrote:
>
>> Hi,
>>
>> I submitted a patch using JIRA and the Hudson system told
>> that " -1 contrib tests.  The patch failed contrib unit tests."
>> Seeing the console output, I noticed that it says build successful
>> for contrib tests.  So I am confused that what failed contrib
>> test are referred to in Hudson output?
>>
>> This link https://issues.apache.org/jira/browse/HADOOP-3646
>> has the comments produced by Hudson patch verifier.
>>
>> Thanks,
>> Abdul Qadeer
>>
>
>


Re: How to add/remove slave nodes on run time

2008-07-11 Thread Keliang Zhao
May I ask what is the right command to start a datanode on a slave?

I used a simple one "bin/hadoop datanode &", but I am not sure.

Also. Should I start the tasktracker manually as well?

-Kevin


On Fri, Jul 11, 2008 at 3:56 PM, lohit <[EMAIL PROTECTED]> wrote:
> To add new datanodes, use the same hadoop version already running on your 
> cluster, the right config and start datanode on any node. The datanode would 
> be configured to talk to the namenode by reading the configs and it would 
> join the cluster. To remove datanode(s) you could decommission the datanode 
> and once decommissioned just kill DataNode process. This is described in 
> there http://wiki.apache.org/hadoop/FAQ#17
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Kevin <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, July 11, 2008 3:43:41 PM
> Subject: How to add/remove slave nodes on run time
>
> Hi,
>
> I searched a bit but could not find the answer. What is the right way
> to add (and remove) new slave nodes on run time? Thank you.
>
> -Kevin
>
>


FileInput / RecordReader question

2008-07-11 Thread Kylie McCormick
Hello Again:
I'm currently working with the code for inputs (and inputsplit) with Hadoop.
There is some helpful information on the Map-Reduce tutorial, but I'm having
some issues with the coding-end of it.

I would like to have a file that lists each of the end points I want to
contact, with the following information also listed: URL, client class, and
name. Right now, I see I need to use a RecordReader, since logical splitting
of the file could cause larger entries to be cut in half or shorter entries
to be bunched together. As of right now, the StreamXMLRecordReader is the
closest variation to want I want to use.

(StreamXMLRecordReader information @
http://hadoop.apache.org/core/docs/r0.17.0/api/org/apache/hadoop/streaming/StreamXmlRecordReader.html
)

However, I'm not certain it will provide the functionality that I need. I
would need to extract the three strings to generate the appropriate value.
Is there another tutorial on Input/InputSplit for Hadoop? I am attempting to
code my own RecordReader, and I'm uncertain if that would be necessary...
and, if it is, specifics of the code.

Thanks,
Kylie


Re: How to add/remove slave nodes on run time

2008-07-11 Thread lohit
that should also work if you have set HADOOP_CONF_DIR in env. best way is to 
follow down the the shell script ./bin/start-all.sh which invokes 
./bin/start-dfs.sh which starts datanode like this
"$bin"/hadoop-daemons.sh --config $HADOOP_CONF_DIR start datanode $dataStartOpt
Yes, you need to start tasktracker as well.
Thanks,
Lohit

- Original Message 
From: Keliang Zhao <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Friday, July 11, 2008 4:31:05 PM
Subject: Re: How to add/remove slave nodes on run time

May I ask what is the right command to start a datanode on a slave?

I used a simple one "bin/hadoop datanode &", but I am not sure.

Also. Should I start the tasktracker manually as well?

-Kevin


On Fri, Jul 11, 2008 at 3:56 PM, lohit <[EMAIL PROTECTED]> wrote:
> To add new datanodes, use the same hadoop version already running on your 
> cluster, the right config and start datanode on any node. The datanode would 
> be configured to talk to the namenode by reading the configs and it would 
> join the cluster. To remove datanode(s) you could decommission the datanode 
> and once decommissioned just kill DataNode process. This is described in 
> there http://wiki.apache.org/hadoop/FAQ#17
>
> Thanks,
> Lohit
>
> - Original Message 
> From: Kevin <[EMAIL PROTECTED]>
> To: core-user@hadoop.apache.org
> Sent: Friday, July 11, 2008 3:43:41 PM
> Subject: How to add/remove slave nodes on run time
>
> Hi,
>
> I searched a bit but could not find the answer. What is the right way
> to add (and remove) new slave nodes on run time? Thank you.
>
> -Kevin
>
>



Re: Compiling Word Count in C++ : Hadoop Pipes

2008-07-11 Thread chaitanya krishna
Thanks a lot for the reply!

I'll try to sort out the permission issues. Hopefully, it should work then.

On Fri, Jul 11, 2008 at 10:04 PM, Sandy <[EMAIL PROTECTED]> wrote:

> hadoop-0.17.0 should work. I took a closer look at your error message. It
> seems you need to change permission on some of your files
>
> Try:
>
>  chmod 755 /home/jobs/hadoop-0.17.0/src/examples/pipes/configure
>
>
> At this point you probably will get another "build failed" message, because
> you need to do the same thing on another file (I don't remember it off the
> top of my head). But you can find it out by inspecting this part of the
> error message:
>
> BUILD FAILED
> /home/jobs/hadoop-0.17.0/build.xml:1040: Execute failed:
> java.io.IOException: Cannot run program
> "/home/jobs/hadoop-0.17.0/src/examples/pipes/configure" (in directory
> "/home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/examples/pipes"):
> java.io.IOException: error=13, Permission denied
>
> This means that the file that you can't run is:
> "/home/jobs/hadoop-0.17.0/src/examples/pipes/configure"
>
> due to permission issues. a chmod 755 will fix this. you'll need to do this
> with any "permission denied" message that you get associated with this.
>
> Hope this helps!
>
> -SM
> On Thu, Jul 10, 2008 at 10:03 PM, chaitanya krishna <
> [EMAIL PROTECTED]> wrote:
>
> > I'm using hadoop-0.17.0. Should I be using a more latest version?
> > Please tell me which version did you use?
> >
> > On Fri, Jul 11, 2008 at 2:35 AM, Sandy <[EMAIL PROTECTED]>
> wrote:
> >
> > > One last thing:
> > >
> > > If that doesn't work, try following the instructions on the ubuntu
> > setting
> > > up hadoop tutorial. Even if you aren't running ubuntu, I think it may
> be
> > > possible to use those instructions to set up things properly. That's
> what
> > I
> > > eventually did.
> > >
> > > Link is here:
> > >
> > >
> >
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_(Single-Node_Cluster)
> <
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
> >
> > <
> >
> http://wiki.apache.org/hadoop/Running_Hadoop_On_Ubuntu_Linux_%28Single-Node_Cluster%29
> > >
> > >
> > > -SM
> > >
> > > On Thu, Jul 10, 2008 at 4:02 PM, Sandy <[EMAIL PROTECTED]>
> > wrote:
> > >
> > > > So, I had run into a similar issue. What version of Hadoop are you
> > using?
> > > >
> > > > Make sure you are using the latest version of hadoop. That actually
> > fixed
> > > > it for me. There was something wrong with the build.xml file in
> earlier
> > > > versions that prevented me from being able to get it to work
> properly.
> > > Once
> > > > I upgraded to the latest, it went away.
> > > >
> > > > Hope this helps!
> > > >
> > > > -SM
> > > >
> > > >
> > > > On Thu, Jul 10, 2008 at 1:39 PM, chaitanya krishna <
> > > > [EMAIL PROTECTED]> wrote:
> > > >
> > > >> Hi,
> > > >>
> > > >>  I faced the similar problem as Sandy. But this time I even had the
> > jdk
> > > >> set
> > > >> properly.
> > > >>
> > > >> when i executed:
> > > >> ant -Dcompile.c++=yes examples
> > > >>
> > > >> the following was displayed:
> > > >>
> > > >> Buildfile: build.xml
> > > >>
> > > >> clover.setup:
> > > >>
> > > >> clover.info:
> > > >> [echo]
> > > >> [echo]  Clover not found. Code coverage reports disabled.
> > > >> [echo]
> > > >>
> > > >> clover:
> > > >>
> > > >> init:
> > > >> [touch] Creating /tmp/null358480626
> > > >>   [delete] Deleting: /tmp/null358480626
> > > >>  [exec] svn: '.' is not a working copy
> > > >> [exec] svn: '.' is not a working copy
> > > >>
> > > >> record-parser:
> > > >>
> > > >> compile-rcc-compiler:
> > > >>
> > > >> compile-core-classes:
> > > >>[javac] Compiling 2 source files to
> > > >> /home/jobs/hadoop-0.17.0/build/classes
> > > >>
> > > >> compile-core-native:
> > > >>
> > > >> check-c++-makefiles:
> > > >>
> > > >> create-c++-pipes-makefile:
> > > >>
> > > >> BUILD FAILED
> > > >> /home/jobs/hadoop-0.17.0/build.xml:1017: Execute failed:
> > > >> java.io.IOException: Cannot run program
> > > >> "/home/jobs/hadoop-0.17.0/src/c++/pipes/configure" (in directory
> > > >> "/home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/pipes"):
> > > >> java.io.IOException: error=13, Permission denied
> > > >>
> > > >>
> > > >>
> > > >> when,as suggested by Lohith, following was executed,
> > > >>
> > > >> ant -Dcompile.c++=yes compile-c++-examples
> > > >>
> > > >> the following was displayed
> > > >> Buildfile: build.xml
> > > >>
> > > >> init:
> > > >>[touch] Creating /tmp/null1037468845
> > > >>   [delete] Deleting: /tmp/null1037468845
> > > >>  [exec] svn: '.' is not a working copy
> > > >> [exec] svn: '.' is not a working copy
> > > >>
> > > >> check-c++-makefiles:
> > > >>
> > > >> create-c++-examples-pipes-makefile:
> > > >>[mkdir] Created dir:
> > > >>
> /home/jobs/hadoop-0.17.0/build/c++-build/Linux-i386-32/examples/pipe