RE: OutOfMemoryError with map jobs

2008-09-07 Thread Leon Mergen
Hello Chris,

  From the stack trace you provided, your OOM is probably due to
 HADOOP-3931, which is fixed in 0.17.2. It occurs when the deserialized
 key in an outputted record exactly fills the serialization buffer that
 collects map outputs, causing an allocation as large as the size of
 that buffer. It causes an extra spill, an OOM exception if the task
 JVM has a max heap size too small to mask the bug, and will miss the
 combiner if you've defined one, but it won't drop records.

Ok thanks for that information. I guess that means I will have to upgrade. :-)

  However, I was wondering: are these hard architectural limits? Say
  that I wanted to emit 25,000 maps for a single input record, would
  that mean that I will require huge amounts of (virtual) memory? In
  other words, what exactly is the reason that increasing the number
  of emitted maps per input record causes an OutOfMemoryError ?

 Do you mean the number of output records per input record in the map?
 The memory allocated for collecting records out of the map is (mostly)
 fixed at the size defined in io.sort.mb. The ratio of input records to
 output records does not affect the collection and sort. The number of
 output records can sometimes influence the memory requirements, but
 not significantly. -C

Ok, so I should not have to worry about this too much! Thanks for the reply and 
information!

Regards,

Leon Mergen



Re: no output from job run on cluster

2008-09-07 Thread 叶双明
Are you sure there isn't any error or exception in logs?

2008/9/5, Shirley Cohen [EMAIL PROTECTED]:

 Hi Dmitry,

 Thanks for your suggestion. I checked and the other systems on the cluster
 do seem to have java installed. I was also able to run the job in single
 mode on the cluster. However, as soon as I add the others 15 nodes to the
 slaves file and re-run the job, the problem appears (i.e. there is zero
 output).

 I guess I was going to wait to see if anyone else have seen this problem
 before submitting a bug report.

 Shirley

 On Sep 4, 2008, at 1:37 PM, Dmitry Pushkarev wrote:

 Hi,

 I'd check java version installed, that was the problem in my case, and
 surprisingly no output from hadoop. If it help - can you submit bug
 request
 ? :)

 -Original Message-
 From: Shirley Cohen [mailto:[EMAIL PROTECTED]
 Sent: Thursday, September 04, 2008 10:07 AM
 To: core-user@hadoop.apache.org
 Subject: no output from job run on cluster

 Hi,

 I'm running on hadoop-0.18.0. I have a m-r job that executes
 correctly in standalone mode. However, when run on a cluster, the
 same job produces zero output. It is very bizarre. I looked in the
 logs and couldn't find anything unusual. All I see are the usual
 deprecated filesystem name warnings. Has this ever happened to
 anyone? Do you have any suggestions on how I might go about
 diagnosing the problem?

 Thanks,

 Shirley





-- 
Sorry for my englist!! 明


Re: Hadoop custom readers and writers

2008-09-07 Thread Dennis Kubes
Seems the stuff in Nutch trunk is older, I have an updated version. I 
have sent it to you directly.


Amit K Singh wrote:

Thanks dennis,
I get it, you mean that the big arc file was not split and there was one map
per arc file.


In the new code a single file can be split into multiple maps.


Also I noted that in ARCinputform getSplits is not overidden, so how do ya
make sure that arc file is not split ?. (number of maps property in config
??)


It really shouldn't matter unless you need all the maps from a given 
input file in the say part-x output file and a partitioner wouldn't 
work.  You actually want it to be able to be broken up so it can scale 
properly.




Also any pointers on the other two questions 
1) getSplits for TextInputFormat splits at arbitary bytes.  now that might

lead to truncated line for 2- mappers. How and where in src code  is that
dealt. Any pointers would be of great help.


The new code finds gzip boundaries. and splits at that.  It will 
actually start scanning forward to find the next record at a split. 
Anything before is handled by a different map task that scans a little 
over its split index.



2) class Record is used for what purpose.


Record or RecordReader?

Dennis





Dennis Kubes-2 wrote:
We did something similar with the ARC format where is record (webpage) 
is gzipped and then appended.  It is not exactly the same but it may 
help.  Take a look at the following classes, they are in the Nutch trunk:


org.apache.nutch.tools.arc.ArcInputFormat
org.apache.nutch.tools.arc.ArcRecordReader

The way we did it though was to create an InputFormat and RecordReader 
that extended FileInputFormat and would read and uncompress the records 
on the fly. Unless your files are small I would recommend going that

route.

Dennis

Amit Simgh wrote:

Hi,

I have thousands of  webpages each represented as serialized tree object 
compressed (ZLIB)  together (file size varying from 2.5 GB to 4.5GB).

I have to do some heavy text processing on these pages.

What the the best way to read /access these pages.

Method1
***
1) Write Custom Splitter that
  1. uncompresses the file(2.5GB to 4GB) and then parses it(time : 
around 10 minutes )

  2. Splits the binary data in to parts 10-20
2) Implement specific readers to read a page and present it to mapper

OR.

Method -2
***
Read the entire file w/o splitting : one one Map task per file.
Implement specific readers to read a page and present it to mapper

Slight detour:
I was browing thru code in FileInputFormat and TextInputFormat. In 
getSplit method the file is broken at arbitary byte boundaries.
So in case of TextInputFormat what if last line of mapper is truncated 
(incomplete byte sequence). what happens. Is truncated data lost or 
recovered
Can someone explain and give pointers in code where and how this 
recovery  happens?


I also saw classes like Records . What are these used for?


Regds
Amit S






specifying number of nodes for job

2008-09-07 Thread Sandy
Hi,

This may be a silly question, but I'm strangely having trouble finding an
answer for it (perhaps I'm looking in the wrong places?).

Suppose I have a cluster with n nodes each with m processors.

I wish to test the performance of, say,  the wordcount program on k
processors, where k is varied from k = 1 ... nm.

How would I do this? I'm having trouble finding the proper command line
option in the commands manual (
http://hadoop.apache.org/core/docs/current/commands_manual.html)



Thank you very much for you time.

-SM


Basic code organization questions + scheduling

2008-09-07 Thread Tarjei Huse
Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer 
tasks. The basic flow is


input:array of urls
actions:  |
1.  get pages
  |
2.  extract new urls from pages - start new job
extract text  - index / filter (as new jobs)

What I'm considering is how I should build this application to fit into 
the map/reduce context. I'm thinking that step 1 and 2 should be 
separate map/reduce tasks that then pipe things on to the next step.


This is where I am a bit at loss to see how it is smart to organize the 
code in logical units and also how to spawn new tasks when an old one is 
over.


Is the usual way to control the flow of a set of tasks to have an 
external application running that listens to jobs ending via the 
endNotificationUri and then spawns new tasks or should the job itself 
contain code to create new jobs? Would it be a good idea to use 
Cascading here?


I'm also considering how I should do job scheduling (I got a lot of 
reoccurring tasks). Has anyone found a good framework for job control of 
reoccurring tasks or should I plan to build my own using quartz ?


Any tips/best practices with regard to the issues described above are 
most welcome. Feel free to ask further questions if you find my 
descriptions of the issues lacking.


Kind regards,
Tarjei




Failing MR jobs!

2008-09-07 Thread Erik Holstad
Hi!
I'm trying to run a MR job, but it keeps on failing and I can't understand
why.
Sometimes it shows output at 66% and sometimes 98% or so.
I had a couple of exception before that I didn't catch that made the job to
fail.


The log file from the task can be found at:
http://pastebin.com/m4414d369


and the code looks like:
//Java
import java.io.*;
import java.util.*;
import java.net.*;

//Hadoop
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;

//HBase
import org.apache.hadoop.hbase.*;
import org.apache.hadoop.hbase.mapred.*;
import org.apache.hadoop.hbase.io.*;
import org.apache.hadoop.hbase.client.*;
// org.apache.hadoop.hbase.client.HTable

//Extra
import org.apache.commons.cli.ParseException;

import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager;
import org.apache.commons.httpclient.*;
import org.apache.commons.httpclient.methods.*;
import org.apache.commons.httpclient.params.HttpMethodParams;


public class SerpentMR1 extends TableMap implements Mapper, Tool {

//Setting DebugLevel
private static final int DL = 0;

//Setting up the variables for the MR job
private static final String NAME = SerpentMR1;
private static final String INPUTTABLE = sources;
private final String[] COLS = {content:feedurl, content:ttl,
content:updated};


private Configuration conf;

public JobConf createSubmittableJob(String[] args) throws IOException{
JobConf c = new JobConf(getConf(), SerpentMR1.class);
String jar = /home/hbase/SerpentMR/ +NAME+.jar;
c.setJar(jar);
c.setJobName(NAME);

int mapTasks = 4;
int reduceTasks = 20;

c.setNumMapTasks(mapTasks);
c.setNumReduceTasks(reduceTasks);

String inputCols = ;
for (int i=0; iCOLS.length; i++){inputCols += COLS[i] +  ; }

TableMap.initJob(INPUTTABLE, inputCols, this.getClass(), Text.class,
BytesWritable.class, c);
//Classes between:

c.setOutputFormat(TextOutputFormat.class);
Path path = new Path(users); //inserting into a temp table
FileOutputFormat.setOutputPath(c, path);

c.setReducerClass(MyReducer.class);
return c;
}

public void map(ImmutableBytesWritable key, RowResult res,
OutputCollector output, Reporter reporter)
throws IOException {
Cell cellLast= res.get(COLS[2].getBytes());//lastupdate

long oldTime = cellLast.getTimestamp();

Cell cell_ttl= res.get(COLS[1].getBytes());//ttl
long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() );
byte[] url = null;

long currTime = time.GetTimeInMillis();

if(currTime - oldTime  ttl){
url = res.get(COLS[0].getBytes()).getValue();//url
output.collect(new Text(Base64.encode_strip(res.getRow())), new
BytesWritable(url) );/
}
}



public static class MyReducer implements Reducer{
//org.apache.hadoop.mapred.Reducer{


private int timeout = 1000; //Sets the connection timeout time ms;

public void reduce(Object key, Iterator values, OutputCollector
output, Reporter rep)
throws IOException {
HttpClient client = new HttpClient();//new
MultiThreadedHttpConnectionManager());
client.getHttpConnectionManager().
getParams().setConnectionTimeout(timeout);

GetMethod method = null;

int stat = 0;
String content = ;
byte[] colFam = select.getBytes();
byte[] column = lastupdate.getBytes();
byte[] currTime = null;

HBaseRef hbref = new HBaseRef();
JerlType sendjerl = null; //new JerlType();
ArrayList jd = new ArrayList();

InputStream is = null;

while(values.hasNext()){
BytesWritable bw = (BytesWritable)values.next();

String address = new String(bw.get());
try{
System.out.println(address);

method = new GetMethod(address);
method.setFollowRedirects(true);

} catch (Exception e){
System.err.println(Invalid Address);
e.printStackTrace();
}

if (method != null){
try {
// Execute the method.
stat = client.executeMethod(method);

if(stat == 200){
content = ;
is =
(InputStream)(method.getResponseBodyAsStream());

//Write to HBase new stamp select:lastupdate
currTime =
StreamyUtil.LongToBytes(time.GetTimeInMillis() );
jd.add(new 

task assignment managemens.

2008-09-07 Thread Dmitry Pushkarev
Dear Hadoop users,

 

Is it possible without using java manage task assignment to implement some
simple rules?  Like do not launch more that 1 instance of crawling task  on
a machine, and do not run data intensive tasks on remote machines, and do
not run computationally intensive tasks on single-core machines:etc.

 

Now it's done by failing tasks that decided to run on a wrong machine, but I
hope to find some solution on jobtracker side..

 

---

Dmitry



Getting started questions

2008-09-07 Thread John Howland
I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?

 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.
 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.

Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM
all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.

3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?

Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more real.
A whole heap of thanks in advance,

John


Re: specifying number of nodes for job

2008-09-07 Thread Mafish Liu
On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote:

 Hi,

 This may be a silly question, but I'm strangely having trouble finding an
 answer for it (perhaps I'm looking in the wrong places?).

 Suppose I have a cluster with n nodes each with m processors.

 I wish to test the performance of, say,  the wordcount program on k
 processors, where k is varied from k = 1 ... nm.


You can  specify the number of tasks for each node in your hadoop-site.xml
file.
So you can get k varied from k = n, 2*nm*n instead of k = 1...nm.


 How would I do this? I'm having trouble finding the proper command line
 option in the commands manual (
 http://hadoop.apache.org/core/docs/current/commands_manual.html)



 Thank you very much for you time.

 -SM




-- 
[EMAIL PROTECTED]
Institute of Computing Technology, Chinese Academy of Sciences, Beijing.


Re: How to debug org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles

2008-09-07 Thread Shihua Ma


You can put src in your classpath,Because  The HttpServer scan the webapps
folder in your classpath to start jetty webserver.

Josh Ma

http://www.hadoop.org.cn http://www.hadoop.org.cn 


Abdul Qadeer wrote:
 
 I am running a single node Hadoop.  If I try to debug
 org.apache.hadoop.streaming.TestMultipleCachefiles.testMultipleCachefiles,
 the following exception tells that
 I haven't put webapps on the classpath.  I have in fact put src/webapps on
 classpath.
 So I was wondering what is wrong.
 
 2008-09-06 20:03:28,183 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(409)) -
 fsOwner=personalpc\admin,None,root,Administrators,Users
 2008-09-06 20:03:28,191 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(413)) -
 supergroup=supergroup
 2008-09-06 20:03:28,191 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(414)) -
 isPermissionEnabled=true
 2008-09-06 20:03:28,440 INFO  common.Storage
 (FSImage.java:saveFSImage(897))
 - Image file of size 90 saved in 0 seconds.
 2008-09-06 20:03:28,510 INFO  common.Storage (FSImage.java:format(952)) -
 Storage directory dfs\name1 has been successfully formatted.
 2008-09-06 20:03:28,538 INFO  common.Storage
 (FSImage.java:saveFSImage(897))
 - Image file of size 90 saved in 0 seconds.
 2008-09-06 20:03:28,590 INFO  common.Storage (FSImage.java:format(952)) -
 Storage directory dfs\name2 has been successfully formatted.
 2008-09-06 20:03:28,904 INFO  metrics.RpcMetrics
 (RpcMetrics.java:init(56)) - Initializing RPC Metrics with
 hostName=NameNode, port=57467
 2008-09-06 20:03:29,246 INFO  namenode.NameNode
 (NameNode.java:initialize(156)) - Namenode up at:
 127.0.0.1/127.0.0.1:57467
 2008-09-06 20:03:29,290 INFO  jvm.JvmMetrics (JvmMetrics.java:init(67)) -
 Initializing JVM Metrics with processName=NameNode, sessionId=null
 2008-09-06 20:03:29,352 INFO  metrics.NameNodeMetrics
 (NameNodeMetrics.java:init(85)) - Initializing NameNodeMeterics using
 context object:org.apache.hadoop.metrics.spi.NullContext
 2008-09-06 20:03:29,417 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(409)) -
 fsOwner=personalpc\admin,None,root,Administrators,Users
 2008-09-06 20:03:29,418 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(413)) -
 supergroup=supergroup
 2008-09-06 20:03:29,418 INFO  fs.FSNamesystem
 (FSNamesystem.java:setConfigurationParameters(414)) -
 isPermissionEnabled=true
 2008-09-06 20:03:29,442 INFO  metrics.FSNamesystemMetrics
 (FSNamesystemMetrics.java:init(60)) - Initializing FSNamesystemMeterics
 using context object:org.apache.hadoop.metrics.spi.NullContext
 2008-09-06 20:03:29,444 INFO  fs.FSNamesystem
 (FSNamesystem.java:registerMBean(4274)) - Registered
 FSNamesystemStatusMBean
 2008-09-06 20:03:29,519 INFO  common.Storage
 (FSImage.java:loadFSImage(748))
 - Number of files = 0
 2008-09-06 20:03:29,519 INFO  common.Storage
 (FSImage.java:loadFilesUnderConstruction(1052)) - Number of files under
 construction = 0
 2008-09-06 20:03:29,520 INFO  common.Storage
 (FSImage.java:loadFSImage(683))
 - Image file of size 90 loaded in 0 seconds.
 2008-09-06 20:03:29,521 INFO  common.Storage
 (FSEditLog.java:loadFSEdits(667)) - Edits file edits of size 4 edits # 0
 loaded in 0 seconds.
 2008-09-06 20:03:29,541 INFO  common.Storage
 (FSImage.java:saveFSImage(897))
 - Image file of size 90 saved in 0 seconds.
 2008-09-06 20:03:29,581 INFO  common.Storage
 (FSImage.java:saveFSImage(897))
 - Image file of size 90 saved in 0 seconds.
 2008-09-06 20:03:29,727 INFO  fs.FSNamesystem
 (FSNamesystem.java:initialize(307)) - Finished loading FSImage in 373
 msecs
 2008-09-06 20:03:29,761 INFO  hdfs.StateChange
 (FSNamesystem.java:leave(3795)) - STATE* Leaving safe mode after 0 secs.
 2008-09-06 20:03:29,762 INFO  hdfs.StateChange
 (FSNamesystem.java:leave(3804)) - STATE* Network topology has 0 racks and
 0
 datanodes
 2008-09-06 20:03:29,762 INFO  hdfs.StateChange
 (FSNamesystem.java:leave(3807)) - STATE* UnderReplicatedBlocks has 0
 blocks
 2008-09-06 20:03:30,173 ERROR fs.FSNamesystem
 (FSNamesystem.java:init(286)) - FSNamesystem initialization failed.
 java.io.IOException: webapps not found in CLASSPATH
 at
 org.apache.hadoop.http.HttpServer.getWebAppsPath(HttpServer.java:157)
 at org.apache.hadoop.http.HttpServer.init(HttpServer.java:74)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.initialize(FSNamesystem.java:347)
 at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.init(FSNamesystem.java:284)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:160)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:205)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.init(NameNode.java:191)
 at
 org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:849)
 at
 org.apache.hadoop.hdfs.MiniDFSCluster.init(MiniDFSCluster.java:270)
 at
 

Re: Basic code organization questions + scheduling

2008-09-07 Thread Alex Loddengaard
Hi Tarjei,

You should take a look at Nutch.  It's a search-engine built on Lucene,
though it can be setup on top of Hadoop.  Take a look:

http://lucene.apache.org/nutch/
-and-
http://wiki.apache.org/nutch/NutchHadoopTutorial

Hope this helps!

Alex

On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote:

 Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer
 tasks. The basic flow is

 input:array of urls
 actions:  |
 1.  get pages
  |
 2.  extract new urls from pages - start new job
extract text  - index / filter (as new jobs)

 What I'm considering is how I should build this application to fit into the
 map/reduce context. I'm thinking that step 1 and 2 should be separate
 map/reduce tasks that then pipe things on to the next step.

 This is where I am a bit at loss to see how it is smart to organize the
 code in logical units and also how to spawn new tasks when an old one is
 over.

 Is the usual way to control the flow of a set of tasks to have an external
 application running that listens to jobs ending via the endNotificationUri
 and then spawns new tasks or should the job itself contain code to create
 new jobs? Would it be a good idea to use Cascading here?

 I'm also considering how I should do job scheduling (I got a lot of
 reoccurring tasks). Has anyone found a good framework for job control of
 reoccurring tasks or should I plan to build my own using quartz ?

 Any tips/best practices with regard to the issues described above are most
 welcome. Feel free to ask further questions if you find my descriptions of
 the issues lacking.

 Kind regards,
 Tarjei





distcp failing

2008-09-07 Thread Michael Di Domenico
I'm attempting to load data into hadoop (version 0.17.1), from a
non-datanode machine in the cluster.  I can run jobs and copyFromLocal works
fine, but when i try to use distcp i get the below.  I'm don't understand
what the error, can anyone help?
Thanks

blue:hadoop-0.17.1 mdidomenico$ time bin/hadoop distcp -overwrite
file:///Users/mdidomenico/hadoop/1gTestfile /user/mdidomenico/1gTestfile
08/09/07 23:56:06 INFO util.CopyFiles:
srcPaths=[file:/Users/mdidomenico/hadoop/1gTestfile]
08/09/07 23:56:06 INFO util.CopyFiles:
destPath=/user/mdidomenico/1gTestfile1
08/09/07 23:56:07 INFO util.CopyFiles: srcCount=1
With failures, global counters are inaccurate; consider running with -i
Copy failed: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
/tmp/hadoop-hadoop/mapred/system/job_200809072254_0005/job.xml: No such file
or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136)
at
org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:175)
at
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896)

at org.apache.hadoop.ipc.Client.call(Client.java:557)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212)
at $Proxy1.submitJob(Unknown Source)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
at
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
at $Proxy1.submitJob(Unknown Source)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973)
at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:604)
at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)


Re: Getting started questions

2008-09-07 Thread Dennis Kubes



John Howland wrote:

I've been reading up on Hadoop for a while now and I'm excited that I'm
finally getting my feet wet with the examples + my own variations. If anyone
could answer any of the following questions, I'd greatly appreciate it.

1. I'm processing document collections, with the number of documents ranging
from 10,000 - 10,000,000. What is the best way to store this data for
effective processing?


AFAIK hadoop doesn't do well with, although it can handle, a large 
number of small files.  So it would be better to read in the documents 
and store them in SequenceFile or MapFile format.  This would be similar 
to the way the Fetcher works in Nutch.  10M documents in a sequence/map 
file on DFS is comparatively small and can be handled efficiently.




 - The bodies of the documents usually range from 1K-100KB in size, but some
outliers can be as big as 4-5GB.


I would say store your document objects as Text objects, not sure if 
Text has a max size.  I think it does but not sure what that is.  If it 
does you can always store as a BytesWritable which is just an array of 
bytes.  But you are going to have memory issues reading in and writing 
out that large of a record.



 - I will also need to store some metadata for each document which I figure
could be stored as JSON or XML.
 - I'll typically filter on the metadata and then doing standard operations
on the bodies, like word frequency and searching.


It is possible to create an OutputFormat that writes out multiple files. 
 You could also use a MapWritable as the value to store the document 
and associated metadata.




Is there a canned FileInputFormat that makes sense? Should I roll my own?
How can I access the bodies as streams so I don't have to read them into RAM


A writable is read into RAM so even treating it like a stream doesn't 
get around that.


One thing you might want to consider is to  tar up say X documents at a 
time and store that as a file in DFS.  You would have many of these 
files.  Then have an index that has the offsets of the files and their 
keys (document ids).  That index can be passed as input into a MR job 
that can then go to DFS and stream out the file as you need it.  The job 
will be slower because you are doing it this way but it is a solution to 
handling such large documents as streams.



all at once? Am I right in thinking that I should treat each document as a
record and map across them, or do I need to be more creative in what I'm
mapping across?

2. Some of the tasks I want to run are pure map operations (no reduction),
where I'm calculating new metadata fields on each document. To end up with a
good result set, I'll need to copy the entire input record + new fields into
another set of output files. Is there a better way? I haven't wanted to go
down the HBase road because it can't handle very large values (for the
bodies) and it seems to make the most sense to keep the document bodies
together with the metadata, to allow for the greatest locality of reference
on the datanodes.


If you don't specify a reducer, the IdentityReducer is run which simply 
passes through output.




3. I'm sure this is not a new idea, but I haven't seen anything regarding
it... I'll need to run several MR jobs as a pipeline... is there any way for
the map tasks in a subsequent stage to begin processing data from previous
stage's reduce task before that reducer has fully finished?


Yup, just use FileOutputFormat.getOutputPath(previousJobConf);

Dennis


Whatever insight folks could lend me would be a big help in crossing the
chasm from the Word Count and associated examples to something more real.
A whole heap of thanks in advance,

John



Parhely (ORM for HBase) released!

2008-09-07 Thread Marcus Herou
Hi guys.

Finally I released the first draft of an ORM for HBase named Parhely.
Check it out at http://dev.tailsweep.com/

Kindly

//Marcus


Re: task assignment managemens.

2008-09-07 Thread Devaraj Das
No that is not possible today. However, you might want to look at the
TaskScheduler to see if you can implement a scheduler to provide this kind
of task scheduling.

In the current hadoop, one point regarding computationally intensive task is
that if the machine is not able to keep up with the rest of the machines
(and the task on that machine is running slower than others), speculative
execution, if enabled, can help a lot. Also, implicitly, faster/better
machines get more work than the slower machines.


On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:

 Dear Hadoop users,
 
  
 
 Is it possible without using java manage task assignment to implement some
 simple rules?  Like do not launch more that 1 instance of crawling task  on
 a machine, and do not run data intensive tasks on remote machines, and do
 not run computationally intensive tasks on single-core machines:etc.
 
  
 
 Now it's done by failing tasks that decided to run on a wrong machine, but I
 hope to find some solution on jobtracker side..
 
  
 
 ---
 
 Dmitry