Re: task assignment managemens.

2008-09-07 Thread Alejandro Abdelnur
We need something similar
(https://issues.apache.org/jira/browse/HADOOP-3740), the problem with
the TaskScheduler is that does not have hooks into the lifecycle of a
task.

A

On Mon, Sep 8, 2008 at 10:25 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> No that is not possible today. However, you might want to look at the
> TaskScheduler to see if you can implement a scheduler to provide this kind
> of task scheduling.
>
> In the current hadoop, one point regarding computationally intensive task is
> that if the machine is not able to keep up with the rest of the machines
> (and the task on that machine is running slower than others), speculative
> execution, if enabled, can help a lot. Also, implicitly, faster/better
> machines get more work than the slower machines.
>
>
> On 9/8/08 3:27 AM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote:
>
>> Dear Hadoop users,
>>
>>
>>
>> Is it possible without using java manage task assignment to implement some
>> simple rules?  Like do not launch more that 1 instance of crawling task  on
>> a machine, and do not run data intensive tasks on remote machines, and do
>> not run computationally intensive tasks on single-core machines:etc.
>>
>>
>>
>> Now it's done by failing tasks that decided to run on a wrong machine, but I
>> hope to find some solution on jobtracker side..
>>
>>
>>
>> ---
>>
>> Dmitry
>>
>
>
>


Re: Parhely (ORM for HBase) released!

2008-09-07 Thread Mork0075
Hello Marcus,

is there still also a download or only the website?

Marcus Herou schrieb:
> Hi guys.
> 
> Finally I released the first draft of an ORM for HBase named Parhely.
> Check it out at http://dev.tailsweep.com/
> 
> Kindly
> 
> //Marcus
> 



Re: task assignment managemens.

2008-09-07 Thread Devaraj Das
No that is not possible today. However, you might want to look at the
TaskScheduler to see if you can implement a scheduler to provide this kind
of task scheduling.

In the current hadoop, one point regarding computationally intensive task is
that if the machine is not able to keep up with the rest of the machines
(and the task on that machine is running slower than others), speculative
execution, if enabled, can help a lot. Also, implicitly, faster/better
machines get more work than the slower machines.


On 9/8/08 3:27 AM, "Dmitry Pushkarev" <[EMAIL PROTECTED]> wrote:

> Dear Hadoop users,
> 
>  
> 
> Is it possible without using java manage task assignment to implement some
> simple rules?  Like do not launch more that 1 instance of crawling task  on
> a machine, and do not run data intensive tasks on remote machines, and do
> not run computationally intensive tasks on single-core machines:etc.
> 
>  
> 
> Now it's done by failing tasks that decided to run on a wrong machine, but I
> hope to find some solution on jobtracker side..
> 
>  
> 
> ---
> 
> Dmitry
> 




Parhely (ORM for HBase) released!

2008-09-07 Thread Marcus Herou
Hi guys.

Finally I released the first draft of an ORM for HBase named Parhely.
Check it out at http://dev.tailsweep.com/

Kindly

//Marcus


Re: Getting started questions

2008-09-07 Thread Dennis Kubes



John Howland wrote:

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?


AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be similar 
to the way the Fetcher works in Nutch.  10M documents in a sequence/map 
file on DFS is comparatively small and can be handled efficiently.




 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.


I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If it 
does you can always store as a BytesWritable which is just an array of 
bytes.  But you are going to have memory issues reading in and writing 
out that large of a record.



 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.


It is possible to create an OutputFormat that writes out multiple files. 
 You could also use a MapWritable as the value to store the document 
and associated metadata.




Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM


A writable is read into RAM so even treating it like a stream doesn't 
get around that.


One thing you might want to consider is to  tar up say X documents at a 
time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The job 
will be slower because you are doing it this way but it is a solution to 
handling such large documents as streams.



all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.


If you don't specify a reducer, the IdentityReducer is run which simply 
passes through output.




3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?


Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis


Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more "real".
A whole heap of thanks in advance,

John



distcp failing

2008-09-07 Thread Michael Di Domenico
I'm attempting to load data into hadoop (version 0.17.1), from a
non-datanode machine in the cluster.  I can run jobs and copyFromLocal works
fine, but when i try to use distcp i get the below.  I'm don't understand
what the error, can anyone help?
Thanks

blue:hadoop-0.17.1 mdidomenico$ time bin/hadoop distcp -overwrite
file:///Users/mdidomenico/hadoop/1gTestfile /user/mdidomenico/1gTestfile
08/09/07 23:56:06 INFO util.CopyFiles:
srcPaths=[file:/Users/mdidomenico/hadoop/1gTestfile]
08/09/07 23:56:06 INFO util.CopyFiles:
destPath=/user/mdidomenico/1gTestfile1
08/09/07 23:56:07 INFO util.CopyFiles: srcCount=1
With failures, global counters are inaccurate; consider running with -i
Copy failed: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
/tmp/hadoop-hadoop/mapred/system/job_200809072254_0005/job.xml: No such file
or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
at
org.apache.hadoop.mapred.JobInProgress.(JobInProgress.java:175)
at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.submitJob(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:604)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)


Re: Basic code organization questions + scheduling

2008-09-07 Thread Alex Loddengaard
Hi Tarjei,

You should take a look at Nutch.  It's a search-engine built on Lucene,
though it can be setup on top of Hadoop.  Take a look:


-and-


Hope this helps!

Alex

On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse <[EMAIL PROTECTED]> wrote:

> Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
> tasks. The basic flow is
>
> input:array of urls
> actions:  |
> 1.  get pages
>  |
> 2.  extract new urls from pages -> start new job
>extract text  -> index / filter (as new jobs)
>
> What I'm considering is how I should build this application to fit into the
> map/reduce context. I'm thinking that step 1 and 2 should be separate
> map/reduce tasks that then pipe things on to the next step.
>
> This is where I am a bit at loss to see how it is smart to organize the
> code in logical units and also how to spawn new tasks when an old one is
> over.
>
> Is the usual way to control the flow of a set of tasks to have an external
> application running that listens to jobs ending via the endNotificationUri
> and then spawns new tasks or should the job itself contain code to create
> new jobs? Would it be a good idea to use Cascading here?
>
> I'm also considering how I should do job scheduling (I got a lot of
> reoccurring tasks). Has anyone found a good framework for job control of
> reoccurring tasks or should I plan to build my own using quartz ?
>
> Any tips/best practices with regard to the issues described above are most
> welcome. Feel free to ask further questions if you find my descriptions of
> the issues lacking.
>
> Kind regards,
> Tarjei
>
>
>


Re: How to debug org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles

2008-09-07 Thread Shihua Ma


You can put src in your classpath,Because  The HttpServer scan the webapps
folder in your classpath to start jetty webserver.

Josh Ma

http://www.hadoop.org.cn http://www.hadoop.org.cn 


Abdul Qadeer wrote:
> 
> I am running a single node Hadoop.  If I try to debug
> org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles,
> the following exception tells that
> I haven't put webapps on the classpath.  I have in fact put src/webapps on
> classpath.
> So I was wondering what is wrong.
> 
> 2008-09-06 20:03:28,183 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(409)) -
> fsOwner=personalpc\admin,None,root,Administrators,Users
> 2008-09-06 20:03:28,191 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(413)) -
> supergroup=supergroup
> 2008-09-06 20:03:28,191 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(414)) -
> isPermissionEnabled=true
> 2008-09-06 20:03:28,440 INFO  common.Storage
> (FSImage.java:saveFSImage(897))
> - Image file of size 90 saved in 0 seconds.
> 2008-09-06 20:03:28,510 INFO  common.Storage (FSImage.java:format(952)) -
> Storage directory dfs\name1 has been successfully formatted.
> 2008-09-06 20:03:28,538 INFO  common.Storage
> (FSImage.java:saveFSImage(897))
> - Image file of size 90 saved in 0 seconds.
> 2008-09-06 20:03:28,590 INFO  common.Storage (FSImage.java:format(952)) -
> Storage directory dfs\name2 has been successfully formatted.
> 2008-09-06 20:03:28,904 INFO  metrics.RpcMetrics
> (RpcMetrics.java:(56)) - Initializing RPC Metrics with
> hostName=NameNode, port=57467
> 2008-09-06 20:03:29,246 INFO  namenode.NameNode
> (NameNode.java:initialize(156)) - Namenode up at:
> 127.0.0.1/127.0.0.1:57467
> 2008-09-06 20:03:29,290 INFO  jvm.JvmMetrics (JvmMetrics.java:init(67)) -
> Initializing JVM Metrics with processName=NameNode, sessionId=null
> 2008-09-06 20:03:29,352 INFO  metrics.NameNodeMetrics
> (NameNodeMetrics.java:(85)) - Initializing NameNodeMeterics using
> context object:org.apache.hadoop.metrics.spi.NullContext
> 2008-09-06 20:03:29,417 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(409)) -
> fsOwner=personalpc\admin,None,root,Administrators,Users
> 2008-09-06 20:03:29,418 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(413)) -
> supergroup=supergroup
> 2008-09-06 20:03:29,418 INFO  fs.FSNamesystem
> (FSNamesystem.java:setConfigurationParameters(414)) -
> isPermissionEnabled=true
> 2008-09-06 20:03:29,442 INFO  metrics.FSNamesystemMetrics
> (FSNamesystemMetrics.java:(60)) - Initializing FSNamesystemMeterics
> using context object:org.apache.hadoop.metrics.spi.NullContext
> 2008-09-06 20:03:29,444 INFO  fs.FSNamesystem
> (FSNamesystem.java:registerMBean(4274)) - Registered
> FSNamesystemStatusMBean
> 2008-09-06 20:03:29,519 INFO  common.Storage
> (FSImage.java:loadFSImage(748))
> - Number of files = 0
> 2008-09-06 20:03:29,519 INFO  common.Storage
> (FSImage.java:loadFilesUnderConstruction(1052)) - Number of files under
> construction = 0
> 2008-09-06 20:03:29,520 INFO  common.Storage
> (FSImage.java:loadFSImage(683))
> - Image file of size 90 loaded in 0 seconds.
> 2008-09-06 20:03:29,521 INFO  common.Storage
> (FSEditLog.java:loadFSEdits(667)) - Edits file edits of size 4 edits # 0
> loaded in 0 seconds.
> 2008-09-06 20:03:29,541 INFO  common.Storage
> (FSImage.java:saveFSImage(897))
> - Image file of size 90 saved in 0 seconds.
> 2008-09-06 20:03:29,581 INFO  common.Storage
> (FSImage.java:saveFSImage(897))
> - Image file of size 90 saved in 0 seconds.
> 2008-09-06 20:03:29,727 INFO  fs.FSNamesystem
> (FSNamesystem.java:initialize(307)) - Finished loading FSImage in 373
> msecs
> 2008-09-06 20:03:29,761 INFO  hdfs.StateChange
> (FSNamesystem.java:leave(3795)) - STATE* Leaving safe mode after 0 secs.
> 2008-09-06 20:03:29,762 INFO  hdfs.StateChange
> (FSNamesystem.java:leave(3804)) - STATE* Network topology has 0 racks and
> 0
> datanodes
> 2008-09-06 20:03:29,762 INFO  hdfs.StateChange
> (FSNamesystem.java:leave(3807)) - STATE* UnderReplicatedBlocks has 0
> blocks
> 2008-09-06 20:03:30,173 ERROR fs.FSNamesystem
> (FSNamesystem.java:(286)) - FSNamesystem initialization failed.
> java.io.IOException: webapps not found in CLASSPATH
> at
> org.apache.hadoop.http.HttpServer.getWebAppsPath(HttpServer.java:157)
> at org.apache.hadoop.http.HttpServer.(HttpServer.java:74)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:347)
> at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.(FSNamesystem.java:284)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:160)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:205)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:191)
> at
> org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:849)
> at
> org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCl

Re: specifying number of nodes for job

2008-09-07 Thread Mafish Liu
On Mon, Sep 8, 2008 at 2:25 AM, Sandy <[EMAIL PROTECTED]> wrote:

> Hi,
>
> This may be a silly question, but I'm strangely having trouble finding an
> answer for it (perhaps I'm looking in the wrong places?).
>
> Suppose I have a cluster with n nodes each with m processors.
>
> I wish to test the performance of, say,  the wordcount program on k
> processors, where k is varied from k = 1 ... nm.


You can  specify the number of tasks for each node in your hadoop-site.xml
file.
So you can get k varied from k = n, 2*nm*n instead of k = 1...nm.


> How would I do this? I'm having trouble finding the proper command line
> option in the commands manual (
> http://hadoop.apache.org/core/docs/current/commands_manual.html)
>
>
>
> Thank you very much for you time.
>
> -SM
>



-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Getting started questions

2008-09-07 Thread John Howland
I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?

 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.
 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.

Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM
all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.

3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?

Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more "real".
A whole heap of thanks in advance,

John


task assignment managemens.

2008-09-07 Thread Dmitry Pushkarev
Dear Hadoop users,

 

Is it possible without using java manage task assignment to implement some
simple rules?  Like do not launch more that 1 instance of crawling task  on
a machine, and do not run data intensive tasks on remote machines, and do
not run computationally intensive tasks on single-core machines:etc.

 

Now it's done by failing tasks that decided to run on a wrong machine, but I
hope to find some solution on jobtracker side..

 

---

Dmitry



Failing MR jobs!

2008-09-07 Thread Erik Holstad
Hi!
I'm trying to run a MR job, but it keeps on failing and I can't understand
why.
Sometimes it shows output at 66% and sometimes 98% or so.
I had a couple of exception before that I didn't catch that made the job to
fail.


The log file from the task can be found at:
http://pastebin.com/m4414d369


and the code looks like:
//Java
import java.io.*;
import java.util.*;
import java.net.*;

//Hadoop
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

//HBase
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.mapred.*;
import org.apache.hadoop.hbase.io.*;
import org.apache.hadoop.hbase.client.*;
// org.apache.hadoop.hbase.client.HTable

//Extra
import org.apache.commons.cli.ParseException;

import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;


public class SerpentMR1 extends TableMap implements Mapper, Tool {

//Setting DebugLevel
private static final int DL = 0;

//Setting up the variables for the MR job
private static final String NAME = "SerpentMR1";
private static final String INPUTTABLE = "sources";
private final String[] COLS = {"content:feedurl", "content:ttl",
"content:updated"};


private Configuration conf;

public JobConf createSubmittableJob(String[] args) throws IOException{
JobConf c = new JobConf(getConf(), SerpentMR1.class);
String jar = "/home/hbase/SerpentMR/" +NAME+".jar";
c.setJar(jar);
c.setJobName(NAME);

int mapTasks = 4;
int reduceTasks = 20;

c.setNumMapTasks(mapTasks);
c.setNumReduceTasks(reduceTasks);

String inputCols = "";
for (int i=0; i ttl){
url = res.get(COLS[0].getBytes()).getValue();//url
output.collect(new Text(Base64.encode_strip(res.getRow())), new
BytesWritable(url) );/
}
}



public static class MyReducer implements Reducer{
//org.apache.hadoop.mapred.Reducer{


private int timeout = 1000; //Sets the connection timeout time ms;

public void reduce(Object key, Iterator values, OutputCollector
output, Reporter rep)
throws IOException {
HttpClient client = new HttpClient();//new
MultiThreadedHttpConnectionManager());
client.getHttpConnectionManager().
getParams().setConnectionTimeout(timeout);

GetMethod method = null;

int stat = 0;
String content = "";
byte[] colFam = "select".getBytes();
byte[] column = "lastupdate".getBytes();
byte[] currTime = null;

HBaseRef hbref = new HBaseRef();
JerlType sendjerl = null; //new JerlType();
ArrayList jd = new ArrayList();

InputStream is = null;

while(values.hasNext()){
BytesWritable bw = (BytesWritable)values.next();

String address = new String(bw.get());
try{
System.out.println(address);

method = new GetMethod(address);
method.setFollowRedirects(true);

} catch (Exception e){
System.err.println("Invalid Address");
e.printStackTrace();
}

if (method != null){
try {
// Execute the method.
stat = client.executeMethod(method);

if(stat == 200){
content = "";
is =
(InputStream)(method.getResponseBodyAsStream());

//Write to HBase new stamp select:lastupdate
currTime =
StreamyUtil.LongToBytes(time.GetTimeInMillis() );
jd.add(new JerlData(INPUTTABLE.getBytes(),
((Text)key).getBytes(), colFam, column,
currTime));

if (is !=
null){output.collect(((Text)key).getBytes(), is);}
}
else if(stat == 302){
System.err.println("302 not complete in reader!!
New url = ");
}
else {
System.err.println("Method failed: " +
method.getStatusLine());
currTime  =
StreamyUtil.LongToBytes(Long.MAX_VALUE);
jd.add(new JerlData(INPUTTABLE.getBytes(),
((Text)key).getBytes(), colFam, column,
currTime));
//Set select:lastupdate to Long.MAX_VALUE()
}

 

Basic code organization questions + scheduling

2008-09-07 Thread Tarjei Huse
Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer 
tasks. The basic flow is


input:array of urls
actions:  |
1.  get pages
  |
2.  extract new urls from pages -> start new job
extract text  -> index / filter (as new jobs)

What I'm considering is how I should build this application to fit into 
the map/reduce context. I'm thinking that step 1 and 2 should be 
separate map/reduce tasks that then pipe things on to the next step.


This is where I am a bit at loss to see how it is smart to organize the 
code in logical units and also how to spawn new tasks when an old one is 
over.


Is the usual way to control the flow of a set of tasks to have an 
external application running that listens to jobs ending via the 
endNotificationUri and then spawns new tasks or should the job itself 
contain code to create new jobs? Would it be a good idea to use 
Cascading here?


I'm also considering how I should do job scheduling (I got a lot of 
reoccurring tasks). Has anyone found a good framework for job control of 
reoccurring tasks or should I plan to build my own using quartz ?


Any tips/best practices with regard to the issues described above are 
most welcome. Feel free to ask further questions if you find my 
descriptions of the issues lacking.


Kind regards,
Tarjei




specifying number of nodes for job

2008-09-07 Thread Sandy
Hi,

This may be a silly question, but I'm strangely having trouble finding an
answer for it (perhaps I'm looking in the wrong places?).

Suppose I have a cluster with n nodes each with m processors.

I wish to test the performance of, say,  the wordcount program on k
processors, where k is varied from k = 1 ... nm.

How would I do this? I'm having trouble finding the proper command line
option in the commands manual (
http://hadoop.apache.org/core/docs/current/commands_manual.html)



Thank you very much for you time.

-SM


Re: Hadoop custom readers and writers

2008-09-07 Thread Dennis Kubes
Seems the stuff in Nutch trunk is older, I have an updated version. I 
have sent it to you directly.


Amit K Singh wrote:

Thanks dennis,
I get it, you mean that the big arc file was not split and there was one map
per arc file.


In the new code a single file can be split into multiple maps.


Also I noted that in ARCinputform getSplits is not overidden, so how do ya
make sure that arc file is not split ?. (number of maps property in config
??)


It really shouldn't matter unless you need all the maps from a given 
input file in the say part-x output file and a partitioner wouldn't 
work.  You actually want it to be able to be broken up so it can scale 
properly.




Also any pointers on the other two questions 
1) getSplits for TextInputFormat splits at arbitary bytes.  now that might

lead to truncated line for 2- mappers. How and where in src code  is that
dealt. Any pointers would be of great help.


The new code finds gzip boundaries. and splits at that.  It will 
actually start scanning forward to find the next record at a split. 
Anything before is handled by a different map task that scans a little 
over its split index.



2) class Record is used for what purpose.


Record or RecordReader?

Dennis





Dennis Kubes-2 wrote:
We did something similar with the ARC format where is record (webpage) 
is gzipped and then appended.  It is not exactly the same but it may 
help.  Take a look at the following classes, they are in the Nutch trunk:


org.apache.nutch.tools.arc.ArcInputFormat
org.apache.nutch.tools.arc.ArcRecordReader

The way we did it though was to create an InputFormat and RecordReader 
that extended FileInputFormat and would read and uncompress the records 
on the fly. Unless your files are small I would recommend going that

route.

Dennis

Amit Simgh wrote:

Hi,

I have thousands of  webpages each represented as serialized tree object 
compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).

I have to do some heavy text processing on these pages.

What the the best way to read /access these pages.

Method1
***
1) Write Custom Splitter that
  1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
around 10 minutes )

  2. Splits the binary data in to parts 10-20
2) Implement specific readers to read a page and present it to mapper

OR.

Method -2
***
Read the entire file w/o splitting : one one Map task per file.
Implement specific readers to read a page and present it to mapper

Slight detour:
I was browing thru code in FileInputFormat and TextInputFormat. In 
getSplit method the file is broken at arbitary byte boundaries.
So in case of TextInputFormat what if last line of mapper is truncated 
(incomplete byte sequence). what happens. Is truncated data lost or 
recovered
Can someone explain and give pointers in code where and how this 
recovery  happens?


I also saw classes like Records . What are these used for?


Regds
Amit S






Re: no output from job run on cluster

2008-09-07 Thread 叶双明
Are you sure there isn't any error or exception in logs?

2008/9/5, Shirley Cohen <[EMAIL PROTECTED]>:
>
> Hi Dmitry,
>
> Thanks for your suggestion. I checked and the other systems on the cluster
> do seem to have java installed. I was also able to run the job in single
> mode on the cluster. However, as soon as I add the others 15 nodes to the
> slaves file and re-run the job, the problem appears (i.e. there is zero
> output).
>
> I guess I was going to wait to see if anyone else have seen this problem
> before submitting a bug report.
>
> Shirley
>
> On Sep 4, 2008, at 1:37 PM, Dmitry Pushkarev wrote:
>
> Hi,
>>
>> I'd check java version installed, that was the problem in my case, and
>> surprisingly no output from hadoop. If it help - can you submit bug
>> request
>> ? :)
>>
>> -Original Message-
>> From: Shirley Cohen [mailto:[EMAIL PROTECTED]
>> Sent: Thursday, September 04, 2008 10:07 AM
>> To: core-user@hadoop.apache.org
>> Subject: no output from job run on cluster
>>
>> Hi,
>>
>> I'm running on hadoop-0.18.0. I have a m-r job that executes
>> correctly in standalone mode. However, when run on a cluster, the
>> same job produces zero output. It is very bizarre. I looked in the
>> logs and couldn't find anything unusual. All I see are the usual
>> deprecated filesystem name warnings. Has this ever happened to
>> anyone? Do you have any suggestions on how I might go about
>> diagnosing the problem?
>>
>> Thanks,
>>
>> Shirley
>>
>>
>


-- 
Sorry for my englist!! 明


RE: OutOfMemoryError with map jobs

2008-09-07 Thread Leon Mergen
Hello Chris,

>  From the stack trace you provided, your OOM is probably due to
> HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized
> key in an outputted record exactly fills the serialization buffer that
> collects map outputs, causing an allocation as large as the size of
> that buffer. It causes an extra spill, an OOM exception if the task
> JVM has a max heap size too small to mask the bug, and will miss the
> combiner if you've defined one, but it won't drop records.

Ok thanks for that information. I guess that means I will have to upgrade. :-)

> > However, I was wondering: are these hard architectural limits? Say
> > that I wanted to emit 25,000 maps for a single input record, would
> > that mean that I will require huge amounts of (virtual) memory? In
> > other words, what exactly is the reason that increasing the number
> > of emitted maps per input record causes an OutOfMemoryError ?
>
> Do you mean the number of output records per input record in the map?
> The memory allocated for collecting records out of the map is (mostly)
> fixed at the size defined in io.sort.mb. The ratio of input records to
> output records does not affect the collection and sort. The number of
> output records can sometimes influence the memory requirements, but
> not significantly. -C

Ok, so I should not have to worry about this too much! Thanks for the reply and 
information!

Regards,

Leon Mergen



Re: Hadoop custom readers and writers

2008-09-07 Thread Amit K Singh

Thanks dennis,
I get it, you mean that the big arc file was not split and there was one map
per arc file.
Also I noted that in ARCinputform getSplits is not overidden, so how do ya
make sure that arc file is not split ?. (number of maps property in config
??)

Also any pointers on the other two questions 
1) getSplits for TextInputFormat splits at arbitary bytes.  now that might
lead to truncated line for 2- mappers. How and where in src code  is that
dealt. Any pointers would be of great help.
2) class Record is used for what purpose.



Dennis Kubes-2 wrote:
> 
> We did something similar with the ARC format where is record (webpage) 
> is gzipped and then appended.  It is not exactly the same but it may 
> help.  Take a look at the following classes, they are in the Nutch trunk:
> 
> org.apache.nutch.tools.arc.ArcInputFormat
> org.apache.nutch.tools.arc.ArcRecordReader
> 
> The way we did it though was to create an InputFormat and RecordReader 
> that extended FileInputFormat and would read and uncompress the records 
> on the fly. Unless your files are small I would recommend going that
> route.
> 
> Dennis
> 
> Amit Simgh wrote:
>> Hi,
>> 
>> I have thousands of  webpages each represented as serialized tree object 
>> compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).
>> I have to do some heavy text processing on these pages.
>> 
>> What the the best way to read /access these pages.
>> 
>> Method1
>> ***
>> 1) Write Custom Splitter that
>>   1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
>> around 10 minutes )
>>   2. Splits the binary data in to parts 10-20
>> 2) Implement specific readers to read a page and present it to mapper
>> 
>> OR.
>> 
>> Method -2
>> ***
>> Read the entire file w/o splitting : one one Map task per file.
>> Implement specific readers to read a page and present it to mapper
>> 
>> Slight detour:
>> I was browing thru code in FileInputFormat and TextInputFormat. In 
>> getSplit method the file is broken at arbitary byte boundaries.
>> So in case of TextInputFormat what if last line of mapper is truncated 
>> (incomplete byte sequence). what happens. Is truncated data lost or 
>> recovered
>> Can someone explain and give pointers in code where and how this 
>> recovery  happens?
>> 
>> I also saw classes like Records . What are these used for?
>> 
>> 
>> Regds
>> Amit S
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Hadoop-custom-readers-and-writers-tp19350341p19356648.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data

2008-09-07 Thread Dhruba Borthakur
The DFS errors might have been caused by

http://issues.apache.org/jira/browse/HADOOP-4040

thanks,
dhruba

On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das <[EMAIL PROTECTED]> wrote:
> These exceptions are apparently coming from the dfs side of things. Could
> someone from the dfs side please look at these?
>
>
> On 9/5/08 3:04 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>
>> Hi,
>>
>> Thanks!
>> The patch applies without change to hadoop-0.18.0, and should be
>> included in a 0.18.1.
>>
>> However, I'm still seeing:
>> in hadoop.log:
>> 2008-09-05 11:13:54,805 WARN  dfs.DFSClient - Exception while reading
>> from blk_3428404120239503595_2664 of
>> /user/trank/segments/20080905102650/crawl_generate/part-00010 from
>> somehost:50010: java.io.IOException: Premeture EOF from in
>> putStream
>>
>> in datanode.log:
>> 2008-09-05 11:15:09,554 WARN  dfs.DataNode -
>> DatanodeRegistration(somehost:50010,
>> storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075,
>> ipcPort=50020):Got exception while serving
>> blk_-4682098638573619471_2662 to
>> /somehost:
>> java.net.SocketTimeoutException: 48 millis timeout while waiting
>> for channel to be ready for write. ch :
>> java.nio.channels.SocketChannel[connected local=/somehost:50010
>> remote=/somehost:45244]
>>
>> These entries in datanode.log happens a few minutes apart repeatedly.
>> I've reduced # map-tasks so load on this node is below 1.0 with 5GB of
>> free memory (so it's not resource starvation).
>>
>> Espen
>>
>> On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:
 I started a profile of the reduce-task. I've attached the profiling output.
 It seems from the samples that ramManager.waitForDataToMerge() doesn't
 actually wait.
 Has anybody seen this behavior.
>>>
>>> This has been fixed in HADOOP-3940
>>>
>>>
>>> On 9/4/08 6:36 PM, "Espen Amble Kolstad" <[EMAIL PROTECTED]> wrote:
>>>
 I have the same problem on our cluster.

 It seems the reducer-tasks are using all cpu, long before there's anything
 to
 shuffle.

 I started a profile of the reduce-task. I've attached the profiling output.
 It seems from the samples that ramManager.waitForDataToMerge() doesn't
 actually wait.
 Has anybody seen this behavior.

 Espen

 On Thursday 28 August 2008 06:11:42 wangxu wrote:
> Hi,all
> I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar,
> and running hadoop on one namenode and 4 slaves.
> attached is my hadoop-site.xml, and I didn't change the file
> hadoop-default.xml
>
> when data in segments are large,this kind of errors occure:
>
> java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129
> file=/user/root/crawl_debug/segments/20080825053518/content/part-2/data
> at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClient.jav
> a:1462) at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient.java:1
> 312) at
> org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:1417) 
> at
> java.io.DataInputStream.readFully(DataInputStream.java:178)
> at
> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64
> ) at 
> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
> at
> org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java:1646)
> at
> org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceFile.ja
> va:1712) at
> org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile.java:
> 1787) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(SequenceF
> ileRecordReader.java:104) at
> org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRecordRe
> ader.java:79) at
> org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordReader.
> java:112) at
> org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecordReade
> r.java:130) at
> org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector(Compo
> siteRecordReader.java:398) at
> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:5
> 6) at
> org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader.java:3
> 3) at
> org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.java:165)
> at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
> at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)
>
>
> how can I correct this?
> thanks.
> Xu

>>>
>>>
>>>
>
>
>