Re: Simultaneous jobs and map/reduce sharing

2008-04-24 Thread Ted Dunning

There is a JIRA out for better scheduling.

In the mean time, you can hack this by increasing the number of tasks per
node to more than the number of cores and setting the maximum number of maps
and reduces equal to a value less than the number of tasks available.  This
lets you piggy back on the linux scheduler a little bit.  The result will be
that your long job will run for a long time (but no longer than before) and
your small job will steal resources as needed.

This hack only applies (obviously) to situations where you understand your
load (you do) and where the tasks will fit on a node (I think yours will).
Once things become large, it becomes hopeless complex to do this.


On 4/24/08 5:10 PM, "Otis Gospodnetic" <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> I'm trying to run multiple jobs on the same cluster and get them to run
> simultaneously.  I have them running simultaneously "somewhat", but have some
> questions (didn't find answers in FAQ nor Wiki).
> 
> Problem:
> I start 2 jobs with a short (10 sec) pause between them. Job 1 quickly grabs
> all available map tasks and "hogs" them.  Consequently, Job 2 has all its map
> tasks in pending mode until Job 1 gets closer to the end and starts freeing up
> map tasks.
> 
> Example:
> Cluster size = 4 nodes
> Cluster Map Task Capacity = 16
> Cluster Reduce Task Capacity = 8
> mapred.tasktracker.map.tasks.maximum = 4
> mapred.tasktracker.reduce.tasks.maximum = 2
> mapred.map.tasks = 23
> mapred.reduce.tasks = 11
> mapred.speculative.execution = false
> 
> Job 1:
> Map Total = 21
> Reduce Total = 11
> 
> Job 2:
> Map Total = 63
> Reduce Total = 23
> 
> When Job 1 start, it quickly grabs all 16 map tasks (the Cluster Map Task
> Capacity) and only several hours later, when it completes 6 of its 21 tasks
> (21-6=15, which is < 16), it starts freeing up map slots for Job 2.  The same
> thing happens in the reduce phase.
> 
> What I'd like is to find a way to control how much each job gets and thus
> schedule them better.  I believe I could change the number of "Map Total" for
> Job 1, so that it is < Cluster Map Task Capacity, so that Job 2 can get at
> least one map slot right away, but then that Job 1 will take longer.
> 
> 
> If it matters, Job 1 and Job 2 are very different - Job 1 is network intensive
> (Nutch fetcher) and Job 2 is CPU and disk IO intensive (Nutch generate job).
> If I start them separately with whole cluster dedicated to a single running
> job, then Job 1 finishes in about 10 hours, and Job 2 finished is about 1.5
> hours.  I was hoping to start the slow Job 1 and, while it's running, maximize
> the use of the CPU by running and completing several Job 2 instances.
> 
> Are there other, better options?
> 
> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 



Simultaneous jobs and map/reduce sharing

2008-04-24 Thread Otis Gospodnetic
Hi,

I'm trying to run multiple jobs on the same cluster and get them to run 
simultaneously.  I have them running simultaneously "somewhat", but have some 
questions (didn't find answers in FAQ nor Wiki).

Problem:
I start 2 jobs with a short (10 sec) pause between them. Job 1 quickly grabs 
all available map tasks and "hogs" them.  Consequently, Job 2 has all its map 
tasks in pending mode until Job 1 gets closer to the end and starts freeing up 
map tasks.

Example:
Cluster size = 4 nodes
Cluster Map Task Capacity = 16
Cluster Reduce Task Capacity = 8
mapred.tasktracker.map.tasks.maximum = 4
mapred.tasktracker.reduce.tasks.maximum = 2
mapred.map.tasks = 23
mapred.reduce.tasks = 11
mapred.speculative.execution = false

Job 1:
Map Total = 21
Reduce Total = 11

Job 2:
Map Total = 63
Reduce Total = 23

When Job 1 start, it quickly grabs all 16 map tasks (the Cluster Map Task 
Capacity) and only several hours later, when it completes 6 of its 21 tasks 
(21-6=15, which is < 16), it starts freeing up map slots for Job 2.  The same 
thing happens in the reduce phase.

What I'd like is to find a way to control how much each job gets and thus 
schedule them better.  I believe I could change the number of "Map Total" for 
Job 1, so that it is < Cluster Map Task Capacity, so that Job 2 can get at 
least one map slot right away, but then that Job 1 will take longer.  


If it matters, Job 1 and Job 2 are very different - Job 1 is network intensive 
(Nutch fetcher) and Job 2 is CPU and disk IO intensive (Nutch generate job).  
If I start them separately with whole cluster dedicated to a single running 
job, then Job 1 finishes in about 10 hours, and Job 2 finished is about 1.5 
hours.  I was hoping to start the slow Job 1 and, while it's running, maximize 
the use of the CPU by running and completing several Job 2 instances.

Are there other, better options?


Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch



RE: Please Help: Namenode Safemode

2008-04-24 Thread dhruba Borthakur
Ok, cool. The randon delay is used to ensure that the Namenode does not
have to process large number of simultaneous block reports, otherwise
the situation becomes really bad when the Namenode restarts and all
Datanodes sends their block reports at the same time. This becomes worse
if the number of Datanodes is large.

 

-dhruba

 



From: Cagdas Gerede [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 24, 2008 11:56 AM
To: dhruba Borthakur
Cc: core-user@hadoop.apache.org
Subject: Re: Please Help: Namenode Safemode

 

Hi Dhruba,
Thanks for your answer. But I think you missed what I mentioned. I
mentioned that the extenstion is already 0 in my  configuration file.

After spending quite some time on the code, I found the reason. The
reason is dfs.blockreport.initialDelay.
If you do not set this in your config file, then it is 60,000 by
default. In datanodes, a random number between 0-60,000 is chosen.
Then, each datanode delays as long as this random value (in miliseconds)
to send the block report when they register with the namenode. As a
result, this value can be as much as 1 minute. If you want your namenode
start quicker, then you should put a smaller number for
dfs.blockreport.initialDelay.

When I set it to 0, the namenode now starts up in 1-2 seconds.


-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info 



On Wed, Apr 23, 2008 at 4:44 PM, dhruba Borthakur <[EMAIL PROTECTED]>
wrote:

By default, there is a variable called dfs.safemode.extension set in
hadoop-default.xml that is set to 30 seconds. This means that once the
Namenode has one replica of every block, it still waits for 30 more
seconds before exiting Safemode.

 

dhruba

 



From: Cagdas Gerede [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, April 23, 2008 4:37 PM
To: core-user@hadoop.apache.org
Cc: dhruba Borthakur
Subject: Please Help: Namenode Safemode

 

I have a hadoop distributed file system with 3 datanodes. I only have
150 blocks in each datanode. It takes a little more than a minute for
namenode to start and pass safemode phase.

The steps for namenode start, as much as I understand, are:
1) Datanode send a heartbeat to namenode. Namenode tells datanode to
send blockreport as a piggyback to heartbeat.
2) Datanode computes the block report. 
3) Datanode sends it to Namenode.
4) Namenode processes the block report.
5) Namenode safe mode thread monitor checks for exiting, and namenode
exist if threshold is reached and the extension time is passed.

Here are my numbers:
Step 1) Datanodes send heartbeats every 3 seconds. 
Step 2) Datanode computes the block report. (this takes about 20
miliseconds - as shown in the datanodes' logs)
Step 3) No idea? (Depends on the size of blockreport. I suspect this
should not be more than a couple of seconds).
Step 4) No idea? Shouldn't be more than a couple of seconds.
Step 5) Thread checks every second. The extension value in my
configuration is 0. So there is no wait if threshold is achieved.

Given these numbers, can any body explain where does one minute come
from? Shouldn't this step take 10-20 seconds? 
Please help. I am very confused.



-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info 







Re: Where can I purchase blog data

2008-04-24 Thread Otis Gospodnetic
I'd go to Technorati or BuzzLogic or http://spinn3r.com/


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


- Original Message 
> From: Alan Ho <[EMAIL PROTECTED]>
> To: Hadoop 
> Sent: Thursday, April 24, 2008 10:06:46 AM
> Subject: Where can I purchase blog data
> 
> I'd like to do some historical analysis on blogs. Does anyone know where I 
> can 
> buy blog data ?
> 
> Regards,
> Alan Ho


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello Ted,

On Thu, Apr 24, 2008 at 9:10 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Locality aware computing is definitely available in hadoop when you are
> processing native files.  The poster was referring to the situation with
> hbase data.


Ok, thank you for the information!

Regards,

Leon Mergen


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Ted Dunning

Locality aware computing is definitely available in hadoop when you are
processing native files.  The poster was referring to the situation with
hbase data.


On 4/24/08 11:46 AM, "Leon Mergen" <[EMAIL PROTECTED]> wrote:

> On Thu, Apr 24, 2008 at 8:41 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote:
> 
>> I think what you're saying is that you are mostly interested in data
>> locality. I don't think it's done yet, but it would be pretty easy to make
>> HBase provide start keys as well as region locations for splits for a
>> MapReduce job. In theory, that would give you all the pieces you need to run
>> locality-aware processing.
> 
> 
> Aha, so that is the definition: locality aware processing. Yeah, that sounds
> exactly like what I'm looking for.
> 
> So, as far as I understand it, it is (technically) possible, but it's just
> not available within hadoop yet?



Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Ted Dunning

Correct.


On 4/24/08 11:44 AM, "Leon Mergen" <[EMAIL PROTECTED]> wrote:

> 
>> It really depends on what you want to do.
>> 
>> For reference, we push hundreds of millions to billions of log events per
>> day (= many thousands per second) through our logging system, into hadoop
>> and then into Oracle.  Hadoop barely sweats with this load and we are using
>> a tiny cluster.
> 
> 
> But I assume these log events are stored in raw format in HDFS, not HBase ?



Re: Help: When is it safe to discard a block in the application layer

2008-04-24 Thread Cagdas Gerede
In DataStreamer class (in DFSClient.java), there is a line in run() method
like this:

* if (progress != null) { progress.progress(); }*

I think the progress is called only if the block is replicated at least
minimum number of times.

I can pass my progress object and wait on it until this method is invoked to
delete my application cache.


Does this seem right? Am I missing something?

Thanks for your answer.

-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


On Thu, Apr 17, 2008 at 10:21 AM, dhruba Borthakur <[EMAIL PROTECTED]>
wrote:

>  The DFSClient caches small packets (e.g. 64K write buffers) and they are
> lazily flushed to the datanoeds in the pipeline. So, when an application
> completes a out.write() call, it is definitely not guaranteed that data is
> sent to even one datanode.
>
>
>
> One option would be to retrieve cache hints from the Namenode and determine
> if the block has three locations.
>
>
>
> Thanks,
>
> dhruba
>
>
>  --
>
> *From:* Cagdas Gerede [mailto:[EMAIL PROTECTED]
> *Sent:* Wednesday, April 16, 2008 7:40 PM
> *To:* core-user@hadoop.apache.org
> *Subject:* Help: When is it safe to discard a block in the application
> layer
>
>
>
> I am working on an application on top of Hadoop Distributed File System.
>
> High level flow goes like this: User data block arrives to the application
> server. The application server uses DistributedFileSystem api of Hadoop and
> write the data block to the file system. Once the block is replicated three
> times, the application server will notify the user so that the user can get
> rid of the data since it is now in a persistent fault tolerant storage.
>
> I couldn't figure out the following. Let's say, this is my sample program
> to write a block.
>
> byte data[] = new byte[blockSize];
> out.write(data, 0, data.length);
> ...
>
> where out is
> out = fs.create(outFile, FsPermission.getDefault(), true, 4096,
> (short)replicationCount, blockSize, progress);
>
>
> My application writes the data to the stream and then it goes to the next
> line. At this point, I believe I cannot be sure that the block is replicated
> at least, say 3, times. Possibly, under the hood, the DFSClient is still
> trying to push this data to others.
>
> Given this, how is my application going to know that the data block is
> replicated 3 times and it is safe to discard this data?
>
> There are a couple of things you might think:
> 1) Set the minimum replica property to 3: Even if you do this, the
> application still goes to the next line before the data actually replicated
> 3 times.
> 2) Right after you write, you continuously get cache hints from master and
> check if master is aware of 3 replicas of this block: My problem with this
> approach is that the application will wait for a while for every block it
> needs to store since it will take some time for datanodes to report and
> master to process the blockreports. What is worse, if some datanode in the
> pipeline fails, we have no way of knowing the error.
>
> To sum-up, I am not sure when is the right time to discard a block of data
> with the guarantee that it is replicated certain number of times.
>
> Please help,
>
> Thanks,
> Cagdas
>


Re: Please Help: Namenode Safemode

2008-04-24 Thread Cagdas Gerede
Hi Dhruba,
Thanks for your answer. But I think you missed what I mentioned. I mentioned
that the extenstion is already 0 in my  configuration file.

After spending quite some time on the code, I found the reason. The reason
is dfs.blockreport.initialDelay.
If you do not set this in your config file, then it is 60,000 by default. In
datanodes, a random number between 0-60,000 is chosen.
Then, each datanode delays as long as this random value (in miliseconds) to
send the block report when they register with the namenode. As a result,
this value can be as much as 1 minute. If you want your namenode start
quicker, then you should put a smaller number for
dfs.blockreport.initialDelay.

When I set it to 0, the namenode now starts up in 1-2 seconds.


-- 

Best Regards, Cagdas Evren Gerede
Home Page: http://cagdasgerede.info


On Wed, Apr 23, 2008 at 4:44 PM, dhruba Borthakur <[EMAIL PROTECTED]>
wrote:

>  By default, there is a variable called dfs.safemode.extension set in
> hadoop-default.xml that is set to 30 seconds. This means that once the
> Namenode has one replica of every block, it still waits for 30 more seconds
> before exiting Safemode.
>
>
>
> dhruba
>
>
>  --
>
> *From:* Cagdas Gerede [mailto:[EMAIL PROTECTED]
> *Sent:* Wednesday, April 23, 2008 4:37 PM
> *To:* core-user@hadoop.apache.org
> *Cc:* dhruba Borthakur
> *Subject:* Please Help: Namenode Safemode
>
>
>
> I have a hadoop distributed file system with 3 datanodes. I only have 150
> blocks in each datanode. It takes a little more than a minute for namenode
> to start and pass safemode phase.
>
> The steps for namenode start, as much as I understand, are:
> 1) Datanode send a heartbeat to namenode. Namenode tells datanode to send
> blockreport as a piggyback to heartbeat.
> 2) Datanode computes the block report.
> 3) Datanode sends it to Namenode.
> 4) Namenode processes the block report.
> 5) Namenode safe mode thread monitor checks for exiting, and namenode exist
> if threshold is reached and the extension time is passed.
>
> Here are my numbers:
> Step 1) Datanodes send heartbeats every 3 seconds.
> Step 2) Datanode computes the block report. (this takes about 20
> miliseconds - as shown in the datanodes' logs)
> Step 3) No idea? (Depends on the size of blockreport. I suspect this should
> not be more than a couple of seconds).
> Step 4) No idea? Shouldn't be more than a couple of seconds.
> Step 5) Thread checks every second. The extension value in my configuration
> is 0. So there is no wait if threshold is achieved.
>
> Given these numbers, can any body explain where does one minute come from?
> Shouldn't this step take 10-20 seconds?
> Please help. I am very confused.
>
>
>
> --
> 
> Best Regards, Cagdas Evren Gerede
> Home Page: http://cagdasgerede.info
>


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
On Thu, Apr 24, 2008 at 8:41 PM, Bryan Duxbury <[EMAIL PROTECTED]> wrote:

> I think what you're saying is that you are mostly interested in data
> locality. I don't think it's done yet, but it would be pretty easy to make
> HBase provide start keys as well as region locations for splits for a
> MapReduce job. In theory, that would give you all the pieces you need to run
> locality-aware processing.


Aha, so that is the definition: locality aware processing. Yeah, that sounds
exactly like what I'm looking for.

So, as far as I understand it, it is (technically) possible, but it's just
not available within hadoop yet?

-- 
Leon Mergen
http://www.solatis.com


RE: Best practices for handling many small files

2008-04-24 Thread Joydeep Sen Sarma
Wouldn't it be possible to take the recordreader class as an config
variable and then have a concrete implementation that instantiates the
configured record reader? (like streaminputformat)


What I meant about about the splits wasn't so much about the number of
maps - but how the allocation of files to each map job is done.
Currently the logic (in MultiFileInputFormat) doesn't take the location
into account. All we need to do is sort the files by location in the
getSplits() method and then do the binning. That way - files in the same
split will be co-located. (ok - there are multiple locations for each
file - but I would think choosing _a_ location and binning based on that
would be better then doing so randomly).

-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 24, 2008 12:27 AM
To: core-user@hadoop.apache.org
Subject: Re: Best practices for handling many small files

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.

An InputFormat returns  many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.

Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.

Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.


Joydeep Sen Sarma wrote:
> million map processes are horrible. aside from overhead - don't do it
if u share the cluster with other jobs (all other jobs will get killed
whenever the million map job is finished - see
https://issues.apache.org/jira/browse/HADOOP-2393)
>
> well - even for #2 - it begs the question of how the packing itself
will be parallelized ..
>
> There's a MultiFileInputFormat that can be extended - that allows
processing of multiple files in a single map job. it needs improvement.
For one - it's an abstract class - and a concrete implementation for (at
least)  text files would help. also - the splitting logic is not very
smart (from what i last saw). ideally - it should take the million files
and form it into N groups (say N is size of your cluster) where each
group has files local to the Nth machine and then process them on that
machine. currently it doesn't do this (the groups are arbitrary). But
it's still the way to go ..
>
>
> -Original Message-
> From: [EMAIL PROTECTED] on behalf of Stuart Sierra
> Sent: Wed 4/23/2008 8:55 AM
> To: core-user@hadoop.apache.org
> Subject: Best practices for handling many small files
>  
> Hello all, Hadoop newbie here, asking: what's the preferred way to
> handle large (~1 million) collections of small files (10 to 100KB) in
> which each file is a single "record"?
>
> 1. Ignore it, let Hadoop create a million Map processes;
> 2. Pack all the files into a single SequenceFile; or
> 3. Something else?
>
> I started writing code to do #2, transforming a big tar.bz2 into a
> BLOCK-compressed SequenceFile, with the file names as keys.  Will that
> work?
>
> Thanks,
> -Stuart, altlaw.org
>
>
>   


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
On Thu, Apr 24, 2008 at 8:18 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

>
> Another option is to just store the data in Hadoop and use a tool like Pig,
> Jaql (or grool).  Using these tools to perform ETL for reporting purposes
> and then use these tools or ad hoc map-reduce programs for data-mining
> works
> well.


Yeah, I've read about PIG in some hadoop summit notes; it sounds like a
great development! However, in my case, there are only 2 or 3 different ways
we want to filter the data out of the servers, so it wouldn't be such a big
of deal to write custom applications for them.



> It really depends on what you want to do.
>
> For reference, we push hundreds of millions to billions of log events per
> day (= many thousands per second) through our logging system, into hadoop
> and then into Oracle.  Hadoop barely sweats with this load and we are using
> a tiny cluster.


But I assume these log events are stored in raw format in HDFS, not HBase ?

Regards,

Leon Mergen


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello Peeyush,

On Thu, Apr 24, 2008 at 8:12 PM, Peeyush Bishnoi <[EMAIL PROTECTED]>
wrote:

> Yes you can very well store your data in Tabular Format into Hbase by
> applying the Map-Reduce job on Access logs which has been stored on HDFS .
> So while you initially copy the data in HDFS , your data blocks will be
> created which will be stored on Datanode . After processing of data , it
> will be stored in Hbase HRegion. So your unprocessed data on HDFS and
> processed data in Hbase will be distributed across machines.


Ah yes, I also understood this from reading the BigTable paper and the HBase
architecture docs; HBase uses regions of about 256MB in size, which are
stored on top of HDFS.

But now I am wondering: after that data has been stored inside HBase, is it
possible to process this data without moving it to a different machine ? Say
that I want to data mine on around 100TB of data; if all that data had to be
moved around the cluster before it could be processed, it would be a bit
inefficient. Isn't it a good idea to just process those log files on the
servers they are physically stored on, and, perhaps, allow striping of
multiple MapReduce jobs on the same data by making use of the replication ?

Or is this a bad idea ? Since I've always understood that moving processing
to servers that the data is stored on is cheaper than moving the data to the
servers they can be processed on.

Regards,

Leon Mergen


Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Bryan Duxbury
I think what you're saying is that you are mostly interested in data  
locality. I don't think it's done yet, but it would be pretty easy to  
make HBase provide start keys as well as region locations for splits  
for a MapReduce job. In theory, that would give you all the pieces  
you need to run locality-aware processing.


-Bryan

On Apr 24, 2008, at 10:16 AM, Leon Mergen wrote:


Hello,

I'm sorry if a question like this has been asked before, but I was  
unable to
find an answer for this anywhere on google; if it is off-topic, I  
apologize

in advance.

I'm trying to look a bit into the future, and predict scalability  
problems
for the company I work for: we're using PostgreSQL, and processing  
many

writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data  
mining
on this data, and ideally, need to have this data stored in a  
structured

form (the data is searched in various ways). In other words: a very
interesting problem.

Now, I'm trying to understand a bit of the hadoop/hbase  
architecture: as I
understand it, HDFS, MapReduce and HBase are sufficiently decoupled  
that the
use case I was hoping for is not available; however, I'm still  
going to ask:



Is it possible to store this data in hbase, and thus have all  
access logs
distributed amongst many different servers, and start MapReduce  
jobs on
those actual servers, which process all the data on those servers ?  
In other

words, the data never leaves the actual servers ?

If this isn't possible, is this because someone simply never took  
the time
to implement such a thing, or is it hard to fit in the design (for  
example,
that the JobTracker needs to be aware of the physical locations of  
all the
data, since you don't want to analyze the same (replicated) data  
twice) ?


From what I understand by playing with hadoop for the past few  
days, the
idea is that you fetch your MapReduce data from HDFS rather than  
BigTable,

or am I mistaken ?

Thanks for your time!

Regards,

Leon Mergen




Re: Hadoop and retrieving data from HDFS

2008-04-24 Thread Ted Dunning

Another option is to just store the data in Hadoop and use a tool like Pig,
Jaql (or grool).  Using these tools to perform ETL for reporting purposes
and then use these tools or ad hoc map-reduce programs for data-mining works
well.

It really depends on what you want to do.

For reference, we push hundreds of millions to billions of log events per
day (= many thousands per second) through our logging system, into hadoop
and then into Oracle.  Hadoop barely sweats with this load and we are using
a tiny cluster.

On 4/24/08 11:12 AM, "Peeyush Bishnoi" <[EMAIL PROTECTED]> wrote:

> Hello Leon ,
> 
> Yes you can very well store your data in Tabular Format into Hbase by applying
> the Map-Reduce job on Access logs which has been stored on HDFS . So while you
> initially copy the data in HDFS , your data blocks will be created which will
> be stored on Datanode . After processing of data , it will be stored in Hbase
> HRegion. So your unprocessed data on HDFS and processed data in Hbase will be
> distributed across machines.
>   
> Also you can retrieve the data or mine the data by applying the Map-Reduce job
> on data stored in Hbase.
> 
> 
> Thanks
> 
> 
> ---
> Peeyush
> 
>  
> 
> 
> -Original Message-
> From: Leon Mergen [mailto:[EMAIL PROTECTED]
> Sent: Thu 4/24/2008 10:16 AM
> To: core-user@hadoop.apache.org
> Subject: Hadoop and retrieving data from HDFS
>  
> Hello,
> 
> I'm sorry if a question like this has been asked before, but I was unable to
> find an answer for this anywhere on google; if it is off-topic, I apologize
> in advance.
> 
> I'm trying to look a bit into the future, and predict scalability problems
> for the company I work for: we're using PostgreSQL, and processing many
> writes/second (access logs, currently around 250, but this will only
> increase significantly in the future). Furthermore, we perform data mining
> on this data, and ideally, need to have this data stored in a structured
> form (the data is searched in various ways). In other words: a very
> interesting problem.
> 
> Now, I'm trying to understand a bit of the hadoop/hbase architecture: as I
> understand it, HDFS, MapReduce and HBase are sufficiently decoupled that the
> use case I was hoping for is not available; however, I'm still going to ask:
> 
> 
> Is it possible to store this data in hbase, and thus have all access logs
> distributed amongst many different servers, and start MapReduce jobs on
> those actual servers, which process all the data on those servers ? In other
> words, the data never leaves the actual servers ?
> 
> If this isn't possible, is this because someone simply never took the time
> to implement such a thing, or is it hard to fit in the design (for example,
> that the JobTracker needs to be aware of the physical locations of all the
> data, since you don't want to analyze the same (replicated) data twice) ?
> 
> From what I understand by playing with hadoop for the past few days, the
> idea is that you fetch your MapReduce data from HDFS rather than BigTable,
> or am I mistaken ?
> 
> Thanks for your time!
> 
> Regards,
> 
> Leon Mergen
> 



RE: Hadoop and retrieving data from HDFS

2008-04-24 Thread Peeyush Bishnoi
Hello Leon ,

Yes you can very well store your data in Tabular Format into Hbase by applying 
the Map-Reduce job on Access logs which has been stored on HDFS . So while you 
initially copy the data in HDFS , your data blocks will be created which will 
be stored on Datanode . After processing of data , it will be stored in Hbase 
HRegion. So your unprocessed data on HDFS and processed data in Hbase will be 
distributed across machines.
  
Also you can retrieve the data or mine the data by applying the Map-Reduce job 
on data stored in Hbase.


Thanks


---
Peeyush

 


-Original Message-
From: Leon Mergen [mailto:[EMAIL PROTECTED]
Sent: Thu 4/24/2008 10:16 AM
To: core-user@hadoop.apache.org
Subject: Hadoop and retrieving data from HDFS
 
Hello,

I'm sorry if a question like this has been asked before, but I was unable to
find an answer for this anywhere on google; if it is off-topic, I apologize
in advance.

I'm trying to look a bit into the future, and predict scalability problems
for the company I work for: we're using PostgreSQL, and processing many
writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data mining
on this data, and ideally, need to have this data stored in a structured
form (the data is searched in various ways). In other words: a very
interesting problem.

Now, I'm trying to understand a bit of the hadoop/hbase architecture: as I
understand it, HDFS, MapReduce and HBase are sufficiently decoupled that the
use case I was hoping for is not available; however, I'm still going to ask:


Is it possible to store this data in hbase, and thus have all access logs
distributed amongst many different servers, and start MapReduce jobs on
those actual servers, which process all the data on those servers ? In other
words, the data never leaves the actual servers ?

If this isn't possible, is this because someone simply never took the time
to implement such a thing, or is it hard to fit in the design (for example,
that the JobTracker needs to be aware of the physical locations of all the
data, since you don't want to analyze the same (replicated) data twice) ?

>From what I understand by playing with hadoop for the past few days, the
idea is that you fetch your MapReduce data from HDFS rather than BigTable,
or am I mistaken ?

Thanks for your time!

Regards,

Leon Mergen



Hadoop and retrieving data from HDFS

2008-04-24 Thread Leon Mergen
Hello,

I'm sorry if a question like this has been asked before, but I was unable to
find an answer for this anywhere on google; if it is off-topic, I apologize
in advance.

I'm trying to look a bit into the future, and predict scalability problems
for the company I work for: we're using PostgreSQL, and processing many
writes/second (access logs, currently around 250, but this will only
increase significantly in the future). Furthermore, we perform data mining
on this data, and ideally, need to have this data stored in a structured
form (the data is searched in various ways). In other words: a very
interesting problem.

Now, I'm trying to understand a bit of the hadoop/hbase architecture: as I
understand it, HDFS, MapReduce and HBase are sufficiently decoupled that the
use case I was hoping for is not available; however, I'm still going to ask:


Is it possible to store this data in hbase, and thus have all access logs
distributed amongst many different servers, and start MapReduce jobs on
those actual servers, which process all the data on those servers ? In other
words, the data never leaves the actual servers ?

If this isn't possible, is this because someone simply never took the time
to implement such a thing, or is it hard to fit in the design (for example,
that the JobTracker needs to be aware of the physical locations of all the
data, since you don't want to analyze the same (replicated) data twice) ?

>From what I understand by playing with hadoop for the past few days, the
idea is that you fetch your MapReduce data from HDFS rather than BigTable,
or am I mistaken ?

Thanks for your time!

Regards,

Leon Mergen


Re: hadoop and deprecation

2008-04-24 Thread Doug Cutting

Karl Wettin wrote:

When is depricated methods removed from the API? At new every minor?


http://wiki.apache.org/hadoop/Roadmap

Note the remark: "Prior to 1.0, minor releases follow the rules for 
major releases, except they are still made every few months."


So, since we're still pre-1.0, we try to remove deprecated features in 
the next minor release after they're deprecated.  We don't always manage 
to do this.  Some deprecated features have survived a long time, 
especially when they're widely used internally.


When deprecated code is removed the change is described in the 
"incompatible" section of the release notes.


Doug


Re: hadoop and deprecation

2008-04-24 Thread lohit
If a method is deprecated in version 0.14, it could be removed in version 0.15 
at the earliest. Might be removed anytime starting 0.15.

- Original Message 
From: Karl Wettin <[EMAIL PROTECTED]>
To: core-user@hadoop.apache.org
Sent: Thursday, April 24, 2008 4:07:48 AM
Subject: hadoop and deprecation

When is depricated methods removed from the API? At new every minor?


 karl





Re: reducer outofmemoryerror

2008-04-24 Thread Apurva Jadhav


I made two changes:
1) Increased mapred.child.java opts tp 768m
2) Coalesced the files into smaller number of larger files.

This has resolved my problem and reduced the running time by a factor of 3.

Thanks for all the suggestions.


Ted Dunning wrote:

If these files are small, you will have a significant (but not massive) hit
in performance due to having so many files.


On 4/24/08 12:07 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

  

On Apr 23, 2008, at 7:51 AM, Apurva Jadhav wrote:



There are six reducers and 24000 mappers because there are 24000
files.
The number of tasks per node is 2.
mapred.child.java opts is the default value 200m. What is a good
value for this.? My mappers and reducers are fairly simple and do
not make large allocations.
  

Try upping that to 512M.

Arun



Regards,
aj

Amar Kamat wrote:
  

Apurva Jadhav wrote:


Hi,
 I have a 4 node hadoop 0.15.3 cluster. I am using the default
config files. I am running a map reduce job to process 40 GB log
data.
  

How many maps and reducers are there? Make sure that there are
sufficient number of reducers. Look at conf/hadoop-default.xml
(see mapred.child.java.opts parameter) to change the heap settings.
Amar


Some reduce tasks are failing with the following errors:
1)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable
Connection Culler" Exception in thread
"[EMAIL PROTECTED]"
java.lang.OutOfMemoryError: Java heap space
Exception in thread "IPC Client connection to /127.0.0.1:34691"
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: Java heap
space

2)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable
Connection Culler" java.lang.OutOfMemoryError: Java heap space

syslog:
2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask:
task_200804212359_0007_r_04_0 Merge of the 19 files in
InMemoryFileSystem complete. Local file is /data/hadoop-im2/
mapred/loca
l/task_200804212359_0007_r_04_0/map_22600.out
2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client:
java.net.SocketException: Socket closed
   at java.net.SocketInputStream.read(SocketInputStream.java:
162)
   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read
(Client.java:181)
   at java.io.BufferedInputStream.fill
(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read
(BufferedInputStream.java:235)
   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java:
258)

2008-04-22 20:34:16,032 WARN
org.apache.hadoop.mapred.TaskTracker: Error running child
java.lang.OutOfMemoryError: Java heap space
2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner:
Communication exception: java.lang.OutOfMemoryError: Java heap space

Has anyone experienced similar problem ?   Is there any
configuration change that can help resolve this issue.

Regards,
aj



  




  




Re: Best practices for handling many small files

2008-04-24 Thread Ted Dunning

Stuart,

This will have the (slightly) desirable side-effect of making your total
disk foot-print smaller.  I don't suppose that matters all that much any
more, but it is still a nice thought.


On 4/24/08 8:28 AM, "Stuart Sierra" <[EMAIL PROTECTED]> wrote:

> Thanks for the advice, everyone.  I'm going to go with #2, packing my
> million files into a small number of SequenceFiles.  This is slow, but
> only has to be done once.  My "datacenter" is Amazon Web Services :),
> so storing a few large, compressed files is the easiest way to go.
> 
> My code, if anyone's interested, is here:
> http://stuartsierra.com/2008/04/24/a-million-little-files
> 
> -Stuart
> altlaw.org
> 
> 
> On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
>> Hello all, Hadoop newbie here, asking: what's the preferred way to
>>  handle large (~1 million) collections of small files (10 to 100KB) in
>>  which each file is a single "record"?
>> 
>>  1. Ignore it, let Hadoop create a million Map processes;
>>  2. Pack all the files into a single SequenceFile; or
>>  3. Something else?
>> 
>>  I started writing code to do #2, transforming a big tar.bz2 into a
>>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>>  work?
>> 
>>  Thanks,
>>  -Stuart, altlaw.org
>> 



Re: reducer outofmemoryerror

2008-04-24 Thread Ted Dunning

If these files are small, you will have a significant (but not massive) hit
in performance due to having so many files.


On 4/24/08 12:07 AM, "Arun C Murthy" <[EMAIL PROTECTED]> wrote:

> 
> On Apr 23, 2008, at 7:51 AM, Apurva Jadhav wrote:
> 
>> There are six reducers and 24000 mappers because there are 24000
>> files.
>> The number of tasks per node is 2.
>> mapred.child.java opts is the default value 200m. What is a good
>> value for this.? My mappers and reducers are fairly simple and do
>> not make large allocations.
> 
> Try upping that to 512M.
> 
> Arun
> 
>> Regards,
>> aj
>> 
>> Amar Kamat wrote:
>>> Apurva Jadhav wrote:
 Hi,
  I have a 4 node hadoop 0.15.3 cluster. I am using the default
 config files. I am running a map reduce job to process 40 GB log
 data.
>>> How many maps and reducers are there? Make sure that there are
>>> sufficient number of reducers. Look at conf/hadoop-default.xml
>>> (see mapred.child.java.opts parameter) to change the heap settings.
>>> Amar
 Some reduce tasks are failing with the following errors:
 1)
 stderr
 Exception in thread "org.apache.hadoop.io.ObjectWritable
 Connection Culler" Exception in thread
 "[EMAIL PROTECTED]"
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread "IPC Client connection to /127.0.0.1:34691"
 java.lang.OutOfMemoryError: Java heap space
 Exception in thread "main" java.lang.OutOfMemoryError: Java heap
 space
 
 2)
 stderr
 Exception in thread "org.apache.hadoop.io.ObjectWritable
 Connection Culler" java.lang.OutOfMemoryError: Java heap space
 
 syslog:
 2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask:
 task_200804212359_0007_r_04_0 Merge of the 19 files in
 InMemoryFileSystem complete. Local file is /data/hadoop-im2/
 mapred/loca
 l/task_200804212359_0007_r_04_0/map_22600.out
 2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client:
 java.net.SocketException: Socket closed
at java.net.SocketInputStream.read(SocketInputStream.java:
 162)
at java.io.FilterInputStream.read(FilterInputStream.java:111)
at org.apache.hadoop.ipc.Client$Connection$1.read
 (Client.java:181)
at java.io.BufferedInputStream.fill
 (BufferedInputStream.java:218)
at java.io.BufferedInputStream.read
 (BufferedInputStream.java:235)
at java.io.DataInputStream.readInt(DataInputStream.java:353)
at org.apache.hadoop.ipc.Client$Connection.run(Client.java:
 258)
 
 2008-04-22 20:34:16,032 WARN
 org.apache.hadoop.mapred.TaskTracker: Error running child
 java.lang.OutOfMemoryError: Java heap space
 2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner:
 Communication exception: java.lang.OutOfMemoryError: Java heap space
 
 Has anyone experienced similar problem ?   Is there any
 configuration change that can help resolve this issue.
 
 Regards,
 aj
 
 
 
>>> 
>>> 
>> 
> 



Re: Best practices for handling many small files

2008-04-24 Thread Stuart Sierra
Thanks for the advice, everyone.  I'm going to go with #2, packing my
million files into a small number of SequenceFiles.  This is slow, but
only has to be done once.  My "datacenter" is Amazon Web Services :),
so storing a few large, compressed files is the easiest way to go.

My code, if anyone's interested, is here:
http://stuartsierra.com/2008/04/24/a-million-little-files

-Stuart
altlaw.org


On Wed, Apr 23, 2008 at 11:55 AM, Stuart Sierra <[EMAIL PROTECTED]> wrote:
> Hello all, Hadoop newbie here, asking: what's the preferred way to
>  handle large (~1 million) collections of small files (10 to 100KB) in
>  which each file is a single "record"?
>
>  1. Ignore it, let Hadoop create a million Map processes;
>  2. Pack all the files into a single SequenceFile; or
>  3. Something else?
>
>  I started writing code to do #2, transforming a big tar.bz2 into a
>  BLOCK-compressed SequenceFile, with the file names as keys.  Will that
>  work?
>
>  Thanks,
>  -Stuart, altlaw.org
>


RE: Where can I purchase blog data

2008-04-24 Thread Dejan Diklic
Alan,

Can you ping me offline. We have all sorts of data you might be interested
in.

Regards,
Dejan


Attributor.com
Publish with confidence
- And yes, we are still hiring




-Original Message-
From: Alan Ho [mailto:[EMAIL PROTECTED] 
Sent: Thursday, April 24, 2008 7:07 AM
To: Hadoop
Subject: Where can I purchase blog data

I'd like to do some historical analysis on blogs. Does anyone know where I
can buy blog data ?

Regards,
Alan Ho



  __
Looking for the perfect gift? Give the gift of Flickr! 

http://www.flickr.com/gift/



Where can I purchase blog data

2008-04-24 Thread Alan Ho
I'd like to do some historical analysis on blogs. Does anyone know where I can 
buy blog data ?

Regards,
Alan Ho



  __
Looking for the perfect gift? Give the gift of Flickr! 

http://www.flickr.com/gift/


hadoop and deprecation

2008-04-24 Thread Karl Wettin

When is depricated methods removed from the API? At new every minor?


karl


Re: User accounts in Master and Slaves

2008-04-24 Thread Sridhar Raman
I tried following the instructions for a single-node cluster (as mentioned
in the link).  I am facing a strange roadblock.

In the hadoop-site.xml, I have set the value of hadoop.tmp.dir to
/WORK/temp/hadoop/workspace/hadoop-${user.name}.

After doing this, I run bin/hadoop namenode -format, and this now creates a
hadoop-sridhar folder at under workspace.  This is fine as the user I've
logged on as is "sridhar".

Then I start my cluster by running bin/start-all.sh.  When I do this, the
output I get is as mentioned in this
link,
but a new folder called hadoop-SYSTEM is created under workspace.

And then when I run bin/stop-all.sh, all that I get is a "no tasktracker to
stop, no datanode to stop, ...".  Any idea why this can happen?

Another point is that after starting the cluster, I did a netstat.  I found
multiple entries of localhost:9000 and all had LISTENING.  Is this also
expected behaviour?

On Wed, Apr 23, 2008 at 9:13 PM, Norbert Burger <[EMAIL PROTECTED]>
wrote:

> Yes, this is the suggested configuration.  Hadoop relies on password-less
> SSH to be able to start tasks on slave machines.  You can find
> instructions
> on creating/transferring the SSH keys here:
>
>
> http://www.michael-noll.com/wiki/Running_Hadoop_On_Ubuntu_Linux_%28Multi-Node_Cluster%29
>
> On Wed, Apr 23, 2008 at 4:39 AM, Sridhar Raman <[EMAIL PROTECTED]>
> wrote:
>
> > Ok, what about the issue regarding the users?  Do all the machines need
> to
> > be under the same user?
> >
> > On Wed, Apr 23, 2008 at 12:43 PM, Harish Mallipeddi <
> > [EMAIL PROTECTED]> wrote:
> >
> > > On Wed, Apr 23, 2008 at 3:03 PM, Sridhar Raman <
> [EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > After trying out Hadoop in a single machine, I decided to run a
> > > MapReduce
> > > > across multiple machines.  This is the approach I followed:
> > > > 1 Master
> > > > 1 Slave
> > > >
> > > > (A doubt here:  Can my Master also be used to execute the Map/Reduce
> > > > functions?)
> > > >
> > >
> > > If you add the master node to the list of slaves (conf/slaves), then
> the
> > > master node run will also run a TaskTracker.
> > >
> > >
> > > >
> > > > To do this, I set up the masters and slaves files in the conf
> > directory.
> > > > Following the instructions in this page -
> > > > http://hadoop.apache.org/core/docs/current/cluster_setup.html, I had
> > set
> > > > up
> > > > sshd in both the machines, and was able to ssh from one to the
> other.
> > > >
> > > > I tried to run bin/start-dfs.sh.  Unfortunately, this asked for a
> > > password
> > > > for [EMAIL PROTECTED], while in slave, there was only user2.  While in
> > master,
> > > > user1 was the logged on user.  How do I resolve this?  Should the
> user
> > > > accounts be present in all the machines?  Or can I specify this
> > > somewhere?
> > > >
> > >
> > >
> > >
> > > --
> > > Harish Mallipeddi
> > > circos.com : poundbang.in/blog/
> > >
> >
>


Re: Best practices for handling many small files

2008-04-24 Thread Enis Soztutar

A shameless attempt to defend MultiFileInputFormat :

A concrete implementation of MultiFileInputFormat is not needed, since 
every InputFormat relying on MultiFileInputFormat is expected to have 
its custom RecordReader implementation, thus they need to override 
getRecordReader(). An implementation which returns (sort of) 
LineRecordReader  is under src/examples/.../MultiFileWordCount. However 
we may include it if any generic (for example returning 
SequenceFileRecordReader) implementation pops up.


An InputFormat returns  many Splits from getSplits(JobConf 
job, int numSplits), which is the number of maps, not the number of 
machines in the cluster.


Last of all, MultiFileSplit class implements getLocations() method, 
which returns the files' locations. Thus it's the JT's job to assign 
tasks to leverage local processing.


Coming to the original question, I think #2 is better, if the 
construction of the sequence file is not a bottleneck. You may, for 
example, create several sequence files in parallel and use all of them 
as input w/o merging.



Joydeep Sen Sarma wrote:

million map processes are horrible. aside from overhead - don't do it if u 
share the cluster with other jobs (all other jobs will get killed whenever the 
million map job is finished - see 
https://issues.apache.org/jira/browse/HADOOP-2393)

well - even for #2 - it begs the question of how the packing itself will be 
parallelized ..

There's a MultiFileInputFormat that can be extended - that allows processing of 
multiple files in a single map job. it needs improvement. For one - it's an 
abstract class - and a concrete implementation for (at least)  text files would 
help. also - the splitting logic is not very smart (from what i last saw). 
ideally - it should take the million files and form it into N groups (say N is 
size of your cluster) where each group has files local to the Nth machine and 
then process them on that machine. currently it doesn't do this (the groups are 
arbitrary). But it's still the way to go ..


-Original Message-
From: [EMAIL PROTECTED] on behalf of Stuart Sierra
Sent: Wed 4/23/2008 8:55 AM
To: core-user@hadoop.apache.org
Subject: Best practices for handling many small files
 
Hello all, Hadoop newbie here, asking: what's the preferred way to

handle large (~1 million) collections of small files (10 to 100KB) in
which each file is a single "record"?

1. Ignore it, let Hadoop create a million Map processes;
2. Pack all the files into a single SequenceFile; or
3. Something else?

I started writing code to do #2, transforming a big tar.bz2 into a
BLOCK-compressed SequenceFile, with the file names as keys.  Will that
work?

Thanks,
-Stuart, altlaw.org


  


Re: reducer outofmemoryerror

2008-04-24 Thread Arun C Murthy


On Apr 23, 2008, at 7:51 AM, Apurva Jadhav wrote:

There are six reducers and 24000 mappers because there are 24000  
files.

The number of tasks per node is 2.
mapred.child.java opts is the default value 200m. What is a good  
value for this.? My mappers and reducers are fairly simple and do  
not make large allocations.


Also, how much RAM do you have on each machine? If you have  
sufficient RAM, 512M is a good value (if you can afford 512*4 i.e. 2  
maps and 2 reduces).


Arun


Regards,
aj

Amar Kamat wrote:

Apurva Jadhav wrote:

Hi,
 I have a 4 node hadoop 0.15.3 cluster. I am using the default  
config files. I am running a map reduce job to process 40 GB log  
data.
How many maps and reducers are there? Make sure that there are  
sufficient number of reducers. Look at conf/hadoop-default.xml  
(see mapred.child.java.opts parameter) to change the heap settings.

Amar

Some reduce tasks are failing with the following errors:
1)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable  
Connection Culler" Exception in thread  
"[EMAIL PROTECTED]"  
java.lang.OutOfMemoryError: Java heap space
Exception in thread "IPC Client connection to /127.0.0.1:34691"  
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: Java heap  
space


2)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable  
Connection Culler" java.lang.OutOfMemoryError: Java heap space


syslog:
2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200804212359_0007_r_04_0 Merge of the 19 files in  
InMemoryFileSystem complete. Local file is /data/hadoop-im2/ 
mapred/loca

l/task_200804212359_0007_r_04_0/map_22600.out
2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client:  
java.net.SocketException: Socket closed
   at java.net.SocketInputStream.read(SocketInputStream.java: 
162)

   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read 
(Client.java:181)
   at java.io.BufferedInputStream.fill 
(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read 
(BufferedInputStream.java:235)

   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java: 
258)


2008-04-22 20:34:16,032 WARN  
org.apache.hadoop.mapred.TaskTracker: Error running child

java.lang.OutOfMemoryError: Java heap space
2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner:  
Communication exception: java.lang.OutOfMemoryError: Java heap space


Has anyone experienced similar problem ?   Is there any  
configuration change that can help resolve this issue.


Regards,
aj












Re: reducer outofmemoryerror

2008-04-24 Thread Arun C Murthy


On Apr 23, 2008, at 7:51 AM, Apurva Jadhav wrote:

There are six reducers and 24000 mappers because there are 24000  
files.

The number of tasks per node is 2.
mapred.child.java opts is the default value 200m. What is a good  
value for this.? My mappers and reducers are fairly simple and do  
not make large allocations.


Try upping that to 512M.

Arun


Regards,
aj

Amar Kamat wrote:

Apurva Jadhav wrote:

Hi,
 I have a 4 node hadoop 0.15.3 cluster. I am using the default  
config files. I am running a map reduce job to process 40 GB log  
data.
How many maps and reducers are there? Make sure that there are  
sufficient number of reducers. Look at conf/hadoop-default.xml  
(see mapred.child.java.opts parameter) to change the heap settings.

Amar

Some reduce tasks are failing with the following errors:
1)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable  
Connection Culler" Exception in thread  
"[EMAIL PROTECTED]"  
java.lang.OutOfMemoryError: Java heap space
Exception in thread "IPC Client connection to /127.0.0.1:34691"  
java.lang.OutOfMemoryError: Java heap space
Exception in thread "main" java.lang.OutOfMemoryError: Java heap  
space


2)
stderr
Exception in thread "org.apache.hadoop.io.ObjectWritable  
Connection Culler" java.lang.OutOfMemoryError: Java heap space


syslog:
2008-04-22 19:32:50,784 INFO org.apache.hadoop.mapred.ReduceTask:  
task_200804212359_0007_r_04_0 Merge of the 19 files in  
InMemoryFileSystem complete. Local file is /data/hadoop-im2/ 
mapred/loca

l/task_200804212359_0007_r_04_0/map_22600.out
2008-04-22 20:34:16,012 INFO org.apache.hadoop.ipc.Client:  
java.net.SocketException: Socket closed
   at java.net.SocketInputStream.read(SocketInputStream.java: 
162)

   at java.io.FilterInputStream.read(FilterInputStream.java:111)
   at org.apache.hadoop.ipc.Client$Connection$1.read 
(Client.java:181)
   at java.io.BufferedInputStream.fill 
(BufferedInputStream.java:218)
   at java.io.BufferedInputStream.read 
(BufferedInputStream.java:235)

   at java.io.DataInputStream.readInt(DataInputStream.java:353)
   at org.apache.hadoop.ipc.Client$Connection.run(Client.java: 
258)


2008-04-22 20:34:16,032 WARN  
org.apache.hadoop.mapred.TaskTracker: Error running child

java.lang.OutOfMemoryError: Java heap space
2008-04-22 20:34:16,031 INFO org.apache.hadoop.mapred.TaskRunner:  
Communication exception: java.lang.OutOfMemoryError: Java heap space


Has anyone experienced similar problem ?   Is there any  
configuration change that can help resolve this issue.


Regards,
aj