Re: When does Reduce job start

2011-01-04 Thread sagar naik
Tht is wht I was looking for
Thanks a mil harsh
Kool , now tht I have a start point, I will check it in hadoop 18

-Sagar

On Tue, Jan 4, 2011 at 7:23 PM, Harsh J  wrote:
> Hello Sagar,
>
> On Wed, Jan 5, 2011 at 6:44 AM, sagar naik  wrote:
 Wht is the configuration param to change this behavior
>
> mapred.reduce.slowstart.completed.maps is a property (0.20.x) that
> controls "when" the ReduceTasks have to start getting scheduled. Your
> job would still need free reduce slots for it to begin.
>
> --
> Harsh J
> www.harshj.com
>


Re: monit? daemontools? jsvc? something else?

2011-01-04 Thread Otis Gospodnetic
Ah, more manual work! :(

 You guys never have JVM die "just because"?  I just had a DN's JVM die the 
other day "just because and with no obvious cause".  Restarting it brought it 
back to life, everything recovered smoothly.  Had some automated tool done the 
restart for me, I'd be even happier.

But I'll have to take your advice. :(

Anyone else has a different opinion?
Actually, is anyone actually using any such tools and *not* seeing problems 
when 
they kick in and do their job of restarting dead processes?

Thanks,
Otis

Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Hadoop ecosystem search :: http://search-hadoop.com/



- Original Message 
> From: Brian Bockelman 
> To: common-user@hadoop.apache.org
> Sent: Tue, January 4, 2011 8:43:46 AM
> Subject: Re: monit? daemontools? jsvc? something else?
> 
> I'll second this opinion.  Although there are some tools in life that need  
> to 
>be actively managed like this (and even then, sometimes management tools can  
>be 
>set to be too aggressive, making a bad situation terrible), HDFS is not  one.
> 
> If the JVM dies, you likely need a human brain to log in and figure  out 
> what's 
>wrong - or just keep that node dead.
> 
> Brian
> 
> On Jan 3,  2011, at 10:40 PM, Allen Wittenauer wrote:
> 
> > 
> > On Jan 3, 2011,  at 2:22 AM, Otis Gospodnetic wrote:
> >> I see over on http://search-hadoop.com/?q=monit+daemontools that people 
> >> *do* 
>use 
>
> >> tools like monit and daemontools (and a few other ones) to keep  revive 
>their 
>
> >> Hadoop processes when they die.
> >> 
> > 
> > I'm not a fan of doing this for Hadoop processes,  even TaskTrackers 
> > and 
>DataNodes.  The processes generally die for a reason,  usually indicating that 
>something is wrong with the box.  Restarting those  processes may potentially 
>hide issues.
> 
> 


How to create hadoop-0.21.0-core.jar ?

2011-01-04 Thread magicor

How to create the hadoop-0.21.0-core.jar using the source code? Now when I
compile the code, I need three or more jar files common,hdfs and mapred. I
want to build the hadoop-0.21.0-core.jar to run a hadoop program. Anyone can
help ? 
-- 
View this message in context: 
http://old.nabble.com/How-to-create-hadoop-0.21.0-core.jar---tp30593204p30593204.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.



Re: Data for Testing in Hadoop

2011-01-04 Thread Dave Viner
Also, Amazon offers free public data sets at:

http://aws.amazon.com/datasets?_encoding=UTF8&jiveRedirect=1




On Tue, Jan 4, 2011 at 7:28 PM, Lance Norskog  wrote:

> https://cwiki.apache.org/confluence/display/MAHOUT/Collections
>
> All the collections you can imagine.
>
> On Tue, Jan 4, 2011 at 12:28 AM, Harsh J  wrote:
> > You can use MR to generate the data itself. Checkout GridMix in
> > Hadoop, or PigMix from Pig for examples on general load tests.
> >
> > On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma 
> wrote:
> >> Dear all,
> >>
> >> Designing the architecture is very important for the Hadoop in
> Production
> >> Clusters.
> >>
> >> We are researching to run Hadoop in Cloud in Individual Nodes and in
> Cloud
> >> Environment ( VM's ).
> >>
> >> For this, I require some data for testing. Would anyone send me some
> links
> >> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> >> I shall be grateful for this kindness.
> >>
> >>
> >> Thanks & Regards
> >>
> >> Adarsh Sharma
> >>
> >>
> >
> >
> >
> > --
> > Harsh J
> > www.harshj.com
> >
>
>
>
> --
> Lance Norskog
> goks...@gmail.com
>


Re: Data for Testing in Hadoop

2011-01-04 Thread Lance Norskog
https://cwiki.apache.org/confluence/display/MAHOUT/Collections

All the collections you can imagine.

On Tue, Jan 4, 2011 at 12:28 AM, Harsh J  wrote:
> You can use MR to generate the data itself. Checkout GridMix in
> Hadoop, or PigMix from Pig for examples on general load tests.
>
> On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma  
> wrote:
>> Dear all,
>>
>> Designing the architecture is very important for the Hadoop in Production
>> Clusters.
>>
>> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud
>> Environment ( VM's ).
>>
>> For this, I require some data for testing. Would anyone send me some links
>> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
>> I shall be grateful for this kindness.
>>
>>
>> Thanks & Regards
>>
>> Adarsh Sharma
>>
>>
>
>
>
> --
> Harsh J
> www.harshj.com
>



-- 
Lance Norskog
goks...@gmail.com


Re: When does Reduce job start

2011-01-04 Thread Harsh J
Hello Sagar,

On Wed, Jan 5, 2011 at 6:44 AM, sagar naik  wrote:
>>> Wht is the configuration param to change this behavior

mapred.reduce.slowstart.completed.maps is a property (0.20.x) that
controls "when" the ReduceTasks have to start getting scheduled. Your
job would still need free reduce slots for it to begin.

-- 
Harsh J
www.harshj.com


Re: When does Reduce job start

2011-01-04 Thread James Seigel
As the other gentleman said. The reduce task kinda needs to know all
the data is available before doing its work.

By design.

Cheers
James

Sent from my mobile. Please excuse the typos.

On 2011-01-04, at 6:14 PM, sagar naik  wrote:

> Hi Jeff,
>
> To be clear on my end I m not talking abt reduce () function call but
> spawning of reduce process/task itself
> To rephrase:
>   Reduce Process/Task is not started untill 90% of map task are done
>
>
> -Sagar
> On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean  wrote:
>> It's part of the design that reduce() does not get called until the map
>> phase is complete. You're seeing reduce report as started when map is at 90%
>> complete because hadoop is shuffling data from the mappers that have
>> completed. As currently designed, you can't prematurely start reduce()
>> because there is no way to gaurantee you have all the values for any key
>> until all the mappers are done. reduce() requires a key and all the values
>> for that key in order to execute.
>>
>> Jeff
>>
>>
>> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik  wrote:
>>
>>> Hi All,
>>>
>>> number  of map task: 1000s
>>> number of reduce task:single digit
>>>
>>> In such cases the reduce task wont  started even when few map task are
>>> completed.
>>> Example:
>>> In my observation of a sample run of bin/hadoop jar
>>> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90%
>>> of map task were complete.
>>>
>>> The only reason, I can think of not starting  a reduce task is to
>>> avoid the un-necessary transfer of map output data in case of
>>> failures.
>>>
>>>
>>> Is there a way to quickly start the reduce task in such case ?
>>> Wht is the configuration param to change this behavior
>>>
>>>
>>>
>>> Thanks,
>>> Sagar
>>>
>>


Re: When does Reduce job start

2011-01-04 Thread sagar naik
Hi Jeff,

To be clear on my end I m not talking abt reduce () function call but
spawning of reduce process/task itself
To rephrase:
   Reduce Process/Task is not started untill 90% of map task are done


-Sagar
On Tue, Jan 4, 2011 at 3:14 PM, Jeff Bean  wrote:
> It's part of the design that reduce() does not get called until the map
> phase is complete. You're seeing reduce report as started when map is at 90%
> complete because hadoop is shuffling data from the mappers that have
> completed. As currently designed, you can't prematurely start reduce()
> because there is no way to gaurantee you have all the values for any key
> until all the mappers are done. reduce() requires a key and all the values
> for that key in order to execute.
>
> Jeff
>
>
> On Tue, Jan 4, 2011 at 10:53 AM, sagar naik  wrote:
>
>> Hi All,
>>
>> number  of map task: 1000s
>> number of reduce task:single digit
>>
>> In such cases the reduce task wont  started even when few map task are
>> completed.
>> Example:
>> In my observation of a sample run of bin/hadoop jar
>> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90%
>> of map task were complete.
>>
>> The only reason, I can think of not starting  a reduce task is to
>> avoid the un-necessary transfer of map output data in case of
>> failures.
>>
>>
>> Is there a way to quickly start the reduce task in such case ?
>> Wht is the configuration param to change this behavior
>>
>>
>>
>> Thanks,
>> Sagar
>>
>


Re: When does Reduce job start

2011-01-04 Thread Jeff Bean
It's part of the design that reduce() does not get called until the map
phase is complete. You're seeing reduce report as started when map is at 90%
complete because hadoop is shuffling data from the mappers that have
completed. As currently designed, you can't prematurely start reduce()
because there is no way to gaurantee you have all the values for any key
until all the mappers are done. reduce() requires a key and all the values
for that key in order to execute.

Jeff


On Tue, Jan 4, 2011 at 10:53 AM, sagar naik  wrote:

> Hi All,
>
> number  of map task: 1000s
> number of reduce task:single digit
>
> In such cases the reduce task wont  started even when few map task are
> completed.
> Example:
> In my observation of a sample run of bin/hadoop jar
> hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90%
> of map task were complete.
>
> The only reason, I can think of not starting  a reduce task is to
> avoid the un-necessary transfer of map output data in case of
> failures.
>
>
> Is there a way to quickly start the reduce task in such case ?
> Wht is the configuration param to change this behavior
>
>
>
> Thanks,
> Sagar
>


Re: Rngd

2011-01-04 Thread Ted Dunning
As it normally stands, rngd will only help (it appears) if you have a
hardware RNG.

You need to cheat and use entropy you don't really have.  If you don't mind
hacking your system, you could even do this:

# mv /dev/random /dev/random.orig
# ln /dev/urandom /dev/random

This makes /dev/random behave as if it were /dev/urandom (which it, strictly
speaking, is after you do this).

Don't let your sysadmin see you do this, of course.

On Tue, Jan 4, 2011 at 12:00 PM, Jon Lederman  wrote:

> Hi,
>
> I am trying to locate the source for rngd to build on my embedded processor
> in order to test whether my hadoop setup is stalled due to low entropy?  Do
> u know where I can find this.  I thought it was part of rng-tools but it's
> not.
>
> Thanks
>
> Jon
>
> Sent from my iPhone
>
> Sent from my iPhone
>


Re: When does Reduce job start

2011-01-04 Thread Allen Wittenauer

On Jan 4, 2011, at 10:53 AM, sagar naik wrote:
> 
> The only reason, I can think of not starting  a reduce task is to
> avoid the un-necessary transfer of map output data in case of
> failures.

Reduce tasks also eat slots while doing the map output. On shared 
grids, this can be extremely bad behavior.

> Is there a way to quickly start the reduce task in such case ?
> Wht is the configuration param to change this behavior

mapred.reduce.slowstart.completed.maps

See http://wiki.apache.org/hadoop/LimitingTaskSlotUsage (from the FAQ 2.12/2.13 
questions).



RE:Rngd

2011-01-04 Thread Black, Michael (IS)
http://sourceforge.net/projects/gkernel/files/rng-tools
 
 
rndg is in there.
 
Michael D. Black
Senior Scientist
Advanced Analytics Directorate
Northrop Grumman Information Systems
 



From: Jon Lederman [mailto:jon2...@mac.com]
Sent: Tue 1/4/2011 2:00 PM
To: common-user@hadoop.apache.org
Subject: EXTERNAL:Rngd



Hi,

I am trying to locate the source for rngd to build on my embedded processor in 
order to test whether my hadoop setup is stalled due to low entropy?  Do u know 
where I can find this.  I thought it was part of rng-tools but it's not. 

Thanks

Jon

Sent from my iPhone

Sent from my iPhone




Rngd

2011-01-04 Thread Jon Lederman
Hi,

I am trying to locate the source for rngd to build on my embedded processor in 
order to test whether my hadoop setup is stalled due to low entropy?  Do u know 
where I can find this.  I thought it was part of rng-tools but it's not.  

Thanks

Jon

Sent from my iPhone

Sent from my iPhone


When does Reduce job start

2011-01-04 Thread sagar naik
Hi All,

number  of map task: 1000s
number of reduce task:single digit

In such cases the reduce task wont  started even when few map task are
completed.
Example:
In my observation of a sample run of bin/hadoop jar
hadoop-*examples*.jar pi 1 10, the reduce did not start untill 90%
of map task were complete.

The only reason, I can think of not starting  a reduce task is to
avoid the un-necessary transfer of map output data in case of
failures.


Is there a way to quickly start the reduce task in such case ?
Wht is the configuration param to change this behavior



Thanks,
Sagar


Re: SequenceFiles and streaming or hdfs thrift api

2011-01-04 Thread Owen O'Malley
On Tue, Jan 4, 2011 at 10:02 AM, Marc Sturlese wrote:

> The thing is I want this file to be a SequenceFile, where the key should be
> a Text and the value a Thrift serialized object. Is it possible to reach
> that goal?
>

I've done the work to support that in Java. See my patch in HADOOP-6685. It
also adds seamless support for ProtocolBuffers and Avro in SequenceFiles
with arbitrary combinations of keys and values using different
serializations.

-- Owen


SequenceFiles and streaming or hdfs thrift api

2011-01-04 Thread Marc Sturlese

Hey there,
I have the need to write a file to a hdfs cluster in php. I now I can do
that with the hdfs thrift api.
http://wiki.apache.org/hadoop/HDFS-APIs

The thing is I want this file to be a SequenceFile, where the key should be
a Text and the value a Thrift serialized object. Is it possible to reach
that goal?
Thanks in advance
-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/SequenceFiles-and-streaming-or-hdfs-thrift-api-tp2193101p2193101.html
Sent from the Hadoop lucene-users mailing list archive at Nabble.com.


Re: monit? daemontools? jsvc? something else?

2011-01-04 Thread Brian Bockelman
I'll second this opinion.  Although there are some tools in life that need to 
be actively managed like this (and even then, sometimes management tools can be 
set to be too aggressive, making a bad situation terrible), HDFS is not one.

If the JVM dies, you likely need a human brain to log in and figure out what's 
wrong - or just keep that node dead.

Brian

On Jan 3, 2011, at 10:40 PM, Allen Wittenauer wrote:

> 
> On Jan 3, 2011, at 2:22 AM, Otis Gospodnetic wrote:
>> I see over on http://search-hadoop.com/?q=monit+daemontools that people *do* 
>> use 
>> tools like monit and daemontools (and a few other ones) to keep revive their 
>> Hadoop processes when they die.
>> 
> 
>   I'm not a fan of doing this for Hadoop processes, even TaskTrackers and 
> DataNodes.  The processes generally die for a reason, usually indicating that 
> something is wrong with the box.  Restarting those processes may potentially 
> hide issues.



smime.p7s
Description: S/MIME cryptographic signature


Output is null why?

2011-01-04 Thread Cavus,M.,Fa. Post Direkt
My Outpu ist null Why? Here is my Java Code:
import java.io.IOException;

import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.Get;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Result;
import org.apache.hadoop.hbase.util.Bytes;


public class FirstHBaseClientRead {
public static void main(String[] args) throws IOException {

HBaseConfiguration config = new HBaseConfiguration();

HTable table = new HTable(config, "Table");

Get get = new Get(Bytes.toBytes("FirstRowKey"));

   Result result = table.get(get);
   byte[] value= result.getValue(Bytes.toBytes("F1"),
Bytes.toBytes("FirstColumn"));
   System.out.println(Bytes.toString(value));
   
   
}
}


This is my Test Table:
hbase(main):013:0> scan 'Table'
ROW  COLUMN+CELL

 FirstRowKey column=F1:Firstcolumn,
timestamp=1294132718775, value=First Value

 FirstRowKey1column=F1:Firstcolumn,
timestamp=1294134178724, value=First Value1

 FirstRowKey1column=F1:Firstcolumn1,
timestamp=1294134197574, value=First Value1

2 row(s) in 0.1030 seconds



Re: Hadoop example

2011-01-04 Thread Esteban Gutierrez Moguel
Hi,

Seems that you need to add your hostname/IP pair in /etc/hosts in both
nodes. Also it looks that you need to setup your configuration files
correctly.

This guides can be helpful for you:

http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html

cheers,
esteban.


On Tue, Jan 4, 2011 at 02:38, haiyan  wrote:

> I have two nodes as Hadoop test. When I set fs.default.name to
> hdfs://hostname:54310/ in core-site.xml and mapred.job.tracker to
> hdfs://hostname:54311 in mapred-site.xml,
> I received the following error information while I started it by
> start-all.sh.
>
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
> /home/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to
> 0
> nodes, instead of 1
>at
>
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
>  at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
>at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
>at
>
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>at java.lang.reflect.Method.invoke(Method.java:597)
>at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
>at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
>at java.security.AccessController.doPrivileged(Native Method)
>at javax.security.auth.Subject.doAs(Subject.java:396)
>at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
> ...
> Then I had to change hdfs://hostname:54310/ to hdfs://ipAddress:54310/ and
> hdfs://hostname:54311 to hdfs://ipAddress:54311, it's ok while I started it
> by start-all.sh.
> However, when I run wordcount example, I got the following error message.
>
> java.lang.IllegalArgumentException: Wrong FS:
>
> hdfs://ipAddress:54310/home/hadoop/tmp/mapred/system/job_201101041628_0005/job.xml,
> expected: hdfs://hostname:54310
>at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
>at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
>at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
>at
>
> org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
>at
> org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:745)
>at
> org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664)
>at
> org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97)
>at
>
> org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629)
>
>  From message above, it seems the hdfs://hostname:port is not suitable for
> example run? What should I do ?
>
> Note: ipAddress means ip address I used, hostname means host name  I used
>


Hadoop example

2011-01-04 Thread haiyan
I have two nodes as Hadoop test. When I set fs.default.name to
hdfs://hostname:54310/ in core-site.xml and mapred.job.tracker to
hdfs://hostname:54311 in mapred-site.xml,
I received the following error information while I started it by
start-all.sh.

org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
/home/hadoop/tmp/mapred/system/jobtracker.info could only be replicated to 0
nodes, instead of 1
at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271)
 at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422)
at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953)
...
Then I had to change hdfs://hostname:54310/ to hdfs://ipAddress:54310/ and
hdfs://hostname:54311 to hdfs://ipAddress:54311, it's ok while I started it
by start-all.sh.
However, when I run wordcount example, I got the following error message.

java.lang.IllegalArgumentException: Wrong FS:
hdfs://ipAddress:54310/home/hadoop/tmp/mapred/system/job_201101041628_0005/job.xml,
expected: hdfs://hostname:54310
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:310)
at
org.apache.hadoop.hdfs.DistributedFileSystem.checkPath(DistributedFileSystem.java:99)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getPathName(DistributedFileSystem.java:155)
at
org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:453)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:745)
at
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1664)
at
org.apache.hadoop.mapred.TaskTracker.access$1200(TaskTracker.java:97)
at
org.apache.hadoop.mapred.TaskTracker$TaskLauncher.run(TaskTracker.java:1629)

  From message above, it seems the hdfs://hostname:port is not suitable for
example run? What should I do ?

Note: ipAddress means ip address I used, hostname means host name  I used


Re: Data for Testing in Hadoop

2011-01-04 Thread Harsh J
You can use MR to generate the data itself. Checkout GridMix in
Hadoop, or PigMix from Pig for examples on general load tests.

On Tue, Jan 4, 2011 at 1:01 PM, Adarsh Sharma  wrote:
> Dear all,
>
> Designing the architecture is very important for the Hadoop in Production
> Clusters.
>
> We are researching to run Hadoop in Cloud in Individual Nodes and in Cloud
> Environment ( VM's ).
>
> For this, I require some data for testing. Would anyone send me some links
> for data of different sizes ( 10Gb, 20GB, 30 Gb , 50GB ) .
> I shall be grateful for this kindness.
>
>
> Thanks & Regards
>
> Adarsh Sharma
>
>



-- 
Harsh J
www.harshj.com