Re: hadoop under cygwin issue

2010-02-03 Thread Alex Kozlov
Can you endeavor a simpler job (just to make sure your setup works):

$ hadoop jar $HADOOP_INSTALL/hadoop-*-examples.jar pi 2 2

Alex K

On Wed, Feb 3, 2010 at 8:26 PM, Brian Wolf  wrote:

> Alex, thanks for the help,  it seems to start now, however
>
>
> $ bin/hadoop jar hadoop-*-examples.jar grep -fs local input output
> 'dfs[a-z.]+'
> 10/02/03 20:02:41 WARN fs.FileSystem: "local" is a deprecated filesystem
> name. Use "file:///" instead.
> 10/02/03 20:02:43 INFO mapred.FileInputFormat: Total input paths to process
> : 3
> 10/02/03 20:02:44 INFO mapred.JobClient: Running job: job_201002031354_0013
> 10/02/03 20:02:45 INFO mapred.JobClient:  map 0% reduce 0%
>
>
>
> it hangs here (is pseudo cluster  supposed to work?)
>
>
> these are bottom of various log files
>
> conf log file
>
>
> fs.s3.implorg.apache.hadoop.fs.s3.S3FileSystem
>
> mapred.input.dirfile:/C:/OpenSSH/usr/local/hadoop-0.19.2/input
> mapred.job.tracker.http.address0.0.0.0:50030
> 
> io.file.buffer.size4096
>
> mapred.jobtracker.restart.recoverfalse
>
> io.serializationsorg.apache.hadoop.io.serializer.WritableSerialization
>
> dfs.datanode.handler.count3
>
> mapred.reduce.copy.backoff300
> mapred.task.profilefalse
>
> dfs.replication.considerLoadtrue
>
> jobclient.output.filterFAILED
>
> mapred.tasktracker.map.tasks.maximum2
>
> io.compression.codecsorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
> fs.checkpoint.size67108864
>
>
> bottom
> namenode log
>
> added to blk_6520091160827873550_1036 size 570
> 2010-02-03 20:02:43,826 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=create
>  src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
>  dst=nullperm=brian:supergroup:rw-r--r--
> 2010-02-03 20:02:43,866 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=setPermission
>src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
>dst=nullperm=brian:supergroup:rw-r--r--
> 2010-02-03 20:02:44,026 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
> /cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml.
> blk_517844159758473296_1037
> 2010-02-03 20:02:44,076 INFO org.apache.hadoop.hdfs.StateChange: BLOCK*
> NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to
> blk_517844159758473296_1037 size 16238
> 2010-02-03 20:02:44,257 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
>  src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
>  dst=nullperm=null
> 2010-02-03 20:02:44,527 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
>  src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.jar
>  dst=nullperm=null
> 2010-02-03 20:02:45,258 INFO
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
>  src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.split
>dst=nullperm=null
>
>
> bottom
> datanode log
>
> 2010-02-03 20:02:44,046 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block
> blk_517844159758473296_1037 src: /127.0.0.1:4069 dest: /127.0.0.1:50010
> 2010-02-03 20:02:44,076 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:4069, dest: /127.0.0.1:50010, bytes: 16238, op: HDFS_WRITE,
> cliID: DFSClient_-1424524646, srvID:
> DS-1812377383-192.168.1.5-50010-1265088397104, blockid:
> blk_517844159758473296_1037
> 2010-02-03 20:02:44,086 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for block
> blk_517844159758473296_1037 terminating
> 2010-02-03 20:02:44,457 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:4075, bytes: 16366, op: HDFS_READ,
> cliID: DFSClient_-548531246, srvID:
> DS-1812377383-192.168.1.5-50010-1265088397104, blockid:
> blk_517844159758473296_1037
> 2010-02-03 20:02:44,677 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:4076, bytes: 135168, op: HDFS_READ,
> cliID: DFSClient_-548531246, srvID:
> DS-1812377383-192.168.1.5-50010-1265088397104, blockid:
> blk_-2806977820057440405_1035
> 2010-02-03 20:02:45,278 INFO
> org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /
> 127.0.0.1:50010, dest: /127.0.0.1:4077, bytes: 578, op: HDFS_READ, cliID:
> DFSClient_-548531246, srvID: DS-1812377383-192.168.1.5-50010-1265088397104,
> blockid: blk_6520091160827873550_1036
> 2010-02-03 20:04:10,451 INFO
> org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification
> succeeded for blk_330

Re: hadoop under cygwin issue

2010-02-03 Thread Brian Wolf

Alex, thanks for the help,  it seems to start now, however


$ bin/hadoop jar hadoop-*-examples.jar grep -fs local input output 
'dfs[a-z.]+'
10/02/03 20:02:41 WARN fs.FileSystem: "local" is a deprecated filesystem 
name. Use "file:///" instead.
10/02/03 20:02:43 INFO mapred.FileInputFormat: Total input paths to 
process : 3

10/02/03 20:02:44 INFO mapred.JobClient: Running job: job_201002031354_0013
10/02/03 20:02:45 INFO mapred.JobClient:  map 0% reduce 0%



it hangs here (is pseudo cluster  supposed to work?)


these are bottom of various log files

conf log file

fs.s3.implorg.apache.hadoop.fs.s3.S3FileSystem
mapred.input.dirfile:/C:/OpenSSH/usr/local/hadoop-0.19.2/input
mapred.job.tracker.http.address0.0.0.0:50030
io.file.buffer.size4096
mapred.jobtracker.restart.recoverfalse
io.serializationsorg.apache.hadoop.io.serializer.WritableSerialization
dfs.datanode.handler.count3
mapred.reduce.copy.backoff300
mapred.task.profilefalse
dfs.replication.considerLoadtrue
jobclient.output.filterFAILED
mapred.tasktracker.map.tasks.maximum2
io.compression.codecsorg.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.BZip2Codec
fs.checkpoint.size67108864


bottom
namenode log

added to blk_6520091160827873550_1036 size 570
2010-02-03 20:02:43,826 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=create
src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
dst=nullperm=brian:supergroup:rw-r--r--
2010-02-03 20:02:43,866 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=brian,None,Administrators,Usersip=/127.0.0.1
cmd=setPermission
src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
dst=nullperm=brian:supergroup:rw-r--r--
2010-02-03 20:02:44,026 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.allocateBlock: 
/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml. 
blk_517844159758473296_1037
2010-02-03 20:02:44,076 INFO org.apache.hadoop.hdfs.StateChange: BLOCK* 
NameSystem.addStoredBlock: blockMap updated: 127.0.0.1:50010 is added to 
blk_517844159758473296_1037 size 16238
2010-02-03 20:02:44,257 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.xml
dst=nullperm=null
2010-02-03 20:02:44,527 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.jar
dst=nullperm=null
2010-02-03 20:02:45,258 INFO 
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit: 
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=open
src=/cygwin/tmp/hadoop-brian/mapred/system/job_201002031354_0013/job.split
dst=nullperm=null



bottom
datanode log

2010-02-03 20:02:44,046 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving block 
blk_517844159758473296_1037 src: /127.0.0.1:4069 dest: /127.0.0.1:50010
2010-02-03 20:02:44,076 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:4069, dest: /127.0.0.1:50010, bytes: 16238, op: HDFS_WRITE, 
cliID: DFSClient_-1424524646, srvID: 
DS-1812377383-192.168.1.5-50010-1265088397104, blockid: 
blk_517844159758473296_1037
2010-02-03 20:02:44,086 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode: PacketResponder 0 for 
block blk_517844159758473296_1037 terminating
2010-02-03 20:02:44,457 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:50010, dest: /127.0.0.1:4075, bytes: 16366, op: HDFS_READ, 
cliID: DFSClient_-548531246, srvID: 
DS-1812377383-192.168.1.5-50010-1265088397104, blockid: 
blk_517844159758473296_1037
2010-02-03 20:02:44,677 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:50010, dest: /127.0.0.1:4076, bytes: 135168, op: HDFS_READ, 
cliID: DFSClient_-548531246, srvID: 
DS-1812377383-192.168.1.5-50010-1265088397104, blockid: 
blk_-2806977820057440405_1035
2010-02-03 20:02:45,278 INFO 
org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: 
/127.0.0.1:50010, dest: /127.0.0.1:4077, bytes: 578, op: HDFS_READ, 
cliID: DFSClient_-548531246, srvID: 
DS-1812377383-192.168.1.5-50010-1265088397104, blockid: 
blk_6520091160827873550_1036
2010-02-03 20:04:10,451 INFO 
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification 
succeeded for blk_3301977249866081256_1031
2010-02-03 20:09:35,658 INFO 
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification 
succeeded for blk_9116729021606317943_1025
2010-02-03 20:09:44,671 INFO 
org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification 
succeeded for blk_8602436668984954947_1026






jobtracker log

Input size for jo

Re: Re: Inverse of a matrix using Map - Reduce

2010-02-03 Thread aa225
Hi,
   Any idea how this method will scale for dense matrices ?The kind of matrices 
I
am going to be working with are 500,000*500,000. Will this be a problem. Also
have you used this patch ?

Best Regards from Buffalo

Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)

On Wed 02/03/10  1:41 AM , Ganesh Swami gan...@iamganesh.com sent:
> What about the Moore-Penrose inverse?
> 
> http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse
> 
> The pseudo-inverse coincides with the regular inverse when the matrix
> is non-singular. Moreover, it can be computed using the SVD.
> 
> Here's a patch for a MapReduce version of the SVD:
> https://issues.apache.org/jira/browse/MAHOUT-180
> Ganesh
> 
> On Tue, Feb 2, 2010 at 10:11 PM,   lo.edu> wrote:> Hello People,
> >      
>      My name is Abhishek Agrawal. For
> the last few days I have been trying> to figure out how to calculate the
inverse of a
> matrix using Map Reduce. Matrix> inversion has 2 common approaches. Gaussian-
> Jordan and the cofactor of transpose> method. But both of them dont seem to be
suited
> too well for Map- Reduce.> Gaussian Jordan involves blocking co factoring a
> matrix requires repeated> calculation of determinant.
> >
> > Can some one give me any pointers so as to how
> to solve this problem ?>
> > Best Regards from Buffalo
> >
> > Abhishek Agrawal
> >
> > SUNY- Buffalo
> > (716-435-7122)
> >
> >
> >
> >
> 
> 
> 
> 
> 



Re: hadoop under cygwin issue

2010-02-03 Thread Brian Wolf
Thanks for the insight, Ed.  Thats actually a pretty big "gesalt" for 
me, I have to process it a bit (I had read about it, of course)


Brian


Ed Mazur wrote:

Brian,

It looks like you're confusing your local file system with HDFS. HDFS
sits on top of your file system and is where data for (non-standalone)
Hadoop jobs comes from. You can poll it with "fs -ls ...", so do
something like "hadoop fs -lsr /" to see everything in HDFS. This will
probably shed some light on why your first attempt failed.
/user/brian/input should be a directory with several xml files.

Ed

On Wed, Feb 3, 2010 at 5:17 PM, Brian Wolf  wrote:
  

Alex Kozlov wrote:


Live Nodes  :   0

You datanode is dead.  Look at the logs in the $HADOOP_HOME/logs directory
(or where your logs are) and check the errors.

Alex K

On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf  wrote:


  


Thanks for your help, Alex,

I managed to get past that problem, now I have this problem:

However, when I try to run this example as stated on the quickstart webpage:

bin/hadoop jar hadoop-*-examples.jar grep input  output 'dfs[a-z.]+'

I get this error;
=
java.io.IOException:   Not a file:
hdfs://localhost:9000/user/brian/input/conf
=
so it seems to default to my home directory looking for "input" it
apparently  needs an absolute filepath, however, when I  run that way:

$ bin/hadoop jar hadoop-*-examples.jar grep /usr/local/hadoop-0.19.2/input
 output 'dfs[a-z.]+'

==
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
hdfs://localhost:9000/usr/local/hadoop-0.19.2/input
==
It still isn't happy although this part -> /usr/local/hadoop-0.19.2/input
 <-  does exist


Aaron,

Thanks or your help. I  carefully went through the steps again a couple
times , and ran

after this
bin/hadoop namenode -format

(by the way, it asks if I want to reformat, I've tried it both ways)


then


bin/start-dfs.sh

and

bin/start-all.sh


and then
bin/hadoop fs -put conf input

now the return for this seemed cryptic:


put: Target input/conf is a directory

(??)

 and when I tried

bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

It says something about 0 nodes

(from log file)

2010-02-01 13:26:29,874 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=create

 src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
 dst=nullperm=brian:supergroup:rw-r--r--
2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 9000, call

addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar,
DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException:
File
/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
could
only be replicated to 0 nodes, instead of 1
java.io.IOException: File
/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
could
only be replicated to 0 nodes, instead of 1
 at

org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287)
 at

org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 at

sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)




To maybe rule out something regarding ports or ssh , when I run netstat:

 TCP127.0.0.1:9000 0.0.0.0:0  LISTENING
 TCP127.0.0.1:9001 0.0.0.0:0  LISTENING


and when I browse to http://localhost:50070/


   Cluster Summary

* * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB
/
992.31 MB (0%)
*
Configured Capacity :   0 KB
DFS Used:   0 KB
Non DFS Used:   0 KB
DFS Remaining   :   0 KB
DFS Used%   :   100 %
DFS Remaining%  :   0 %
Live Nodes  :   0
Dead Nodes  :   0


so I'm a bit still in the dark, I guess.

Thanks
Brian




Aaron Kimball wrote:




Brian, it looks like you missed a step in the instructions. You'll need
to
format the hdfs filesystem instance before starting the NameNode server:

You need to run:

$ bin/hadoop namenode -format

.. then you can do bin/start-dfs.sh
Hope this helps,
- Aaron


On Sat, Jan 30, 2010 at 12:27 AM, Brian Wolf  wrote:




  

Hi,

I am trying to run Hadoop 0.19.2 under cygwin as per directions on the
hadoop "quickstart" web page.

I know sshd is running and I can "ssh localhost" without a password.

Hadoop User Group (Bay Area) - Feb 17th at Yahoo!

2010-02-03 Thread Dekel Tankel

Hi all,

RSVP is open for the next monthly Bay Area Hadoop user group at the Yahoo! 
Sunnyvale Campus, Wednesday, Feb 17th, 6PM

Registration and Agenda are available here
http://www.meetup.com/hadoop/calendar/12497904/

Looking forward to seeing you there!

Dekel



Re: hadoop under cygwin issue

2010-02-03 Thread Alex Kozlov
Try

$ bin/hadoop jar hadoop-*-examples.jar grep
file:///usr/local/hadoop-0.19.2/input output 'dfs[a-z.]+'

file:/// is a magical prefix to force hadoop to look for the file in the
local FS

You can also force it to look into local FS by giving '-fs local' or '-fs
file:///' option to the hadoop executable

These options basically overwrite the *fs.default.name* configuration
setting, which should be in your core-site.xml file

You can also copy the content of the input directory to HDFS by executing

$ bin/hadoop fs -mkdir input
$ bin/hadoop fs -copyFromLocal input/* input

Hope this helps

Alex K

On Wed, Feb 3, 2010 at 2:17 PM, Brian Wolf  wrote:

> Alex Kozlov wrote:
>
>> Live Nodes  :   0
>>
>> You datanode is dead.  Look at the logs in the $HADOOP_HOME/logs directory
>> (or where your logs are) and check the errors.
>>
>> Alex K
>>
>> On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf  wrote:
>>
>>
>>
>
>
>
> Thanks for your help, Alex,
>
> I managed to get past that problem, now I have this problem:
>
> However, when I try to run this example as stated on the quickstart
> webpage:
>
>
> bin/hadoop jar hadoop-*-examples.jar grep input  output 'dfs[a-z.]+'
>
> I get this error;
> =
> java.io.IOException:   Not a file:
> hdfs://localhost:9000/user/brian/input/conf
> =
> so it seems to default to my home directory looking for "input" it
> apparently  needs an absolute filepath, however, when I  run that way:
>
> $ bin/hadoop jar hadoop-*-examples.jar grep /usr/local/hadoop-0.19.2/input
>  output 'dfs[a-z.]+'
>
> ==
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> hdfs://localhost:9000/usr/local/hadoop-0.19.2/input
> ==
> It still isn't happy although this part -> /usr/local/hadoop-0.19.2/input
>  <-  does exist
>
>  Aaron,
>>>
>>> Thanks or your help. I  carefully went through the steps again a couple
>>> times , and ran
>>>
>>> after this
>>> bin/hadoop namenode -format
>>>
>>> (by the way, it asks if I want to reformat, I've tried it both ways)
>>>
>>>
>>> then
>>>
>>>
>>> bin/start-dfs.sh
>>>
>>> and
>>>
>>> bin/start-all.sh
>>>
>>>
>>> and then
>>> bin/hadoop fs -put conf input
>>>
>>> now the return for this seemed cryptic:
>>>
>>>
>>> put: Target input/conf is a directory
>>>
>>> (??)
>>>
>>>  and when I tried
>>>
>>> bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
>>>
>>> It says something about 0 nodes
>>>
>>> (from log file)
>>>
>>> 2010-02-01 13:26:29,874 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
>>> ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=create
>>>
>>>  src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>>  dst=nullperm=brian:supergroup:rw-r--r--
>>> 2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 3 on 9000, call
>>>
>>> addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar,
>>> DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException:
>>> File
>>> /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>> could
>>> only be replicated to 0 nodes, instead of 1
>>> java.io.IOException: File
>>> /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>> could
>>> only be replicated to 0 nodes, instead of 1
>>>  at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287)
>>>  at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>  at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>  at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>>
>>>
>>>
>>> To maybe rule out something regarding ports or ssh , when I run netstat:
>>>
>>>  TCP127.0.0.1:9000 0.0.0.0:0  LISTENING
>>>  TCP127.0.0.1:9001 0.0.0.0:0  LISTENING
>>>
>>>
>>> and when I browse to http://localhost:50070/
>>>
>>>
>>>Cluster Summary
>>>
>>> * * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB
>>> /
>>> 992.31 MB (0%)
>>> *
>>> Configured Capacity :   0 KB
>>> DFS Used:   0 KB
>>> Non DFS Used:   0 KB
>>> DFS Remaining   :   0 KB
>>> DFS Used%   :   100 %
>>> DFS Remaining%  :   0 %
>>> Live Nodes  :
>>> 0
>>> Dead Nodes  :
>>> 0
>>>
>>>
>>> so I'm a bit still in the dark, I guess.
>>>
>>> Thanks
>>> Brian
>>>
>>>
>>>
>>>
>>> Aaron Kimball wrote:
>>>
>>>
>>>
 Brian, it looks li

Re: setup cluster with cloudera repo

2010-02-03 Thread Jim Kusznir
These are physical machines, not EC2.

--Jim

On Wed, Feb 3, 2010 at 11:11 AM, zaki rahaman  wrote:
> Are these on physical machines or are you by chance running on EC2?
>
> On Wed, Feb 3, 2010 at 2:07 PM, Jim Kusznir  wrote:
>
>> Hi all:
>>
>> I need to set up a hadoop cluster.  The cluster is based on CentOS
>> 5.4, and I already have all the base OSes installed.
>>
>> I saw that Cloudera had a repo for hadoop CentOS, so I set up that
>> repo, and installed hadoop via yum.  Unfortunately, I'm now at the
>> "now what?" question.  Cloudera's website has many links to "confugre
>> your cluster" or "continue", but that takes one to a page saying
>> "we're redoing it, come back later".  This leaves me with no
>> documentation to follow to actually make this cluster work.
>>
>> How do I proceed?
>>
>> Thanks!
>> --Jim
>>
>
>
>
> --
> Zaki Rahaman
>


Re: hadoop under cygwin issue

2010-02-03 Thread Ed Mazur
Brian,

It looks like you're confusing your local file system with HDFS. HDFS
sits on top of your file system and is where data for (non-standalone)
Hadoop jobs comes from. You can poll it with "fs -ls ...", so do
something like "hadoop fs -lsr /" to see everything in HDFS. This will
probably shed some light on why your first attempt failed.
/user/brian/input should be a directory with several xml files.

Ed

On Wed, Feb 3, 2010 at 5:17 PM, Brian Wolf  wrote:
> Alex Kozlov wrote:
>>
>> Live Nodes      :       0
>>
>> You datanode is dead.  Look at the logs in the $HADOOP_HOME/logs directory
>> (or where your logs are) and check the errors.
>>
>> Alex K
>>
>> On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf  wrote:
>>
>>
>
>
>
> Thanks for your help, Alex,
>
> I managed to get past that problem, now I have this problem:
>
> However, when I try to run this example as stated on the quickstart webpage:
>
> bin/hadoop jar hadoop-*-examples.jar grep input  output 'dfs[a-z.]+'
>
> I get this error;
> =
> java.io.IOException:       Not a file:
> hdfs://localhost:9000/user/brian/input/conf
> =
> so it seems to default to my home directory looking for "input" it
> apparently  needs an absolute filepath, however, when I  run that way:
>
> $ bin/hadoop jar hadoop-*-examples.jar grep /usr/local/hadoop-0.19.2/input
>  output 'dfs[a-z.]+'
>
> ==
> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist:
> hdfs://localhost:9000/usr/local/hadoop-0.19.2/input
> ==
> It still isn't happy although this part -> /usr/local/hadoop-0.19.2/input
>  <-  does exist
>>>
>>> Aaron,
>>>
>>> Thanks or your help. I  carefully went through the steps again a couple
>>> times , and ran
>>>
>>> after this
>>> bin/hadoop namenode -format
>>>
>>> (by the way, it asks if I want to reformat, I've tried it both ways)
>>>
>>>
>>> then
>>>
>>>
>>> bin/start-dfs.sh
>>>
>>> and
>>>
>>> bin/start-all.sh
>>>
>>>
>>> and then
>>> bin/hadoop fs -put conf input
>>>
>>> now the return for this seemed cryptic:
>>>
>>>
>>> put: Target input/conf is a directory
>>>
>>> (??)
>>>
>>>  and when I tried
>>>
>>> bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
>>>
>>> It says something about 0 nodes
>>>
>>> (from log file)
>>>
>>> 2010-02-01 13:26:29,874 INFO
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
>>> ugi=brian,None,Administrators,Users    ip=/127.0.0.1    cmd=create
>>>
>>>  src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>>  dst=null    perm=brian:supergroup:rw-r--r--
>>> 2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server
>>> handler 3 on 9000, call
>>>
>>> addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar,
>>> DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException:
>>> File
>>> /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>> could
>>> only be replicated to 0 nodes, instead of 1
>>> java.io.IOException: File
>>> /cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
>>> could
>>> only be replicated to 0 nodes, instead of 1
>>>  at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287)
>>>  at
>>>
>>> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
>>>  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>  at
>>>
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>>>  at
>>>
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>>>
>>>
>>>
>>>
>>> To maybe rule out something regarding ports or ssh , when I run netstat:
>>>
>>>  TCP    127.0.0.1:9000         0.0.0.0:0              LISTENING
>>>  TCP    127.0.0.1:9001         0.0.0.0:0              LISTENING
>>>
>>>
>>> and when I browse to http://localhost:50070/
>>>
>>>
>>>    Cluster Summary
>>>
>>> * * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB
>>> /
>>> 992.31 MB (0%)
>>> *
>>> Configured Capacity     :       0 KB
>>> DFS Used        :       0 KB
>>> Non DFS Used    :       0 KB
>>> DFS Remaining   :       0 KB
>>> DFS Used%       :       100 %
>>> DFS Remaining%  :       0 %
>>> Live Nodes      :       0
>>> Dead Nodes      :       0
>>>
>>>
>>> so I'm a bit still in the dark, I guess.
>>>
>>> Thanks
>>> Brian
>>>
>>>
>>>
>>>
>>> Aaron Kimball wrote:
>>>
>>>

 Brian, it looks like you missed a step in the instructions. You'll need
 to
 format the hdfs filesystem instance before starting the NameNode server:

 You need to run:

 $ bin/hadoop na

Re: hadoop under cygwin issue

2010-02-03 Thread Brian Wolf

Alex Kozlov wrote:

Live Nodes  :   0

You datanode is dead.  Look at the logs in the $HADOOP_HOME/logs directory
(or where your logs are) and check the errors.

Alex K

On Mon, Feb 1, 2010 at 1:59 PM, Brian Wolf  wrote:

  




Thanks for your help, Alex,

I managed to get past that problem, now I have this problem:

However, when I try to run this example as stated on the quickstart webpage:

bin/hadoop jar hadoop-*-examples.jar grep input  output 'dfs[a-z.]+'

I get this error;
=
java.io.IOException:   Not a file: 
hdfs://localhost:9000/user/brian/input/conf

=
so it seems to default to my home directory looking for "input" it 
apparently  needs an absolute filepath, however, when I  run that way:


$ bin/hadoop jar hadoop-*-examples.jar grep 
/usr/local/hadoop-0.19.2/input  output 'dfs[a-z.]+'


==
org.apache.hadoop.mapred.InvalidInputException: Input path does not 
exist: hdfs://localhost:9000/usr/local/hadoop-0.19.2/input

==
It still isn't happy although this part -> 
/usr/local/hadoop-0.19.2/input<-  does exist

Aaron,

Thanks or your help. I  carefully went through the steps again a couple
times , and ran

after this
bin/hadoop namenode -format

(by the way, it asks if I want to reformat, I've tried it both ways)


then


bin/start-dfs.sh

and

bin/start-all.sh


and then
bin/hadoop fs -put conf input

now the return for this seemed cryptic:


put: Target input/conf is a directory

(??)

 and when I tried

bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

It says something about 0 nodes

(from log file)

2010-02-01 13:26:29,874 INFO
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.audit:
ugi=brian,None,Administrators,Usersip=/127.0.0.1cmd=create
 src=/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar
 dst=nullperm=brian:supergroup:rw-r--r--
2010-02-01 13:26:30,045 INFO org.apache.hadoop.ipc.Server: IPC Server
handler 3 on 9000, call
addBlock(/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar,
DFSClient_725490811) from 127.0.0.1:3003: error: java.io.IOException: File
/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar could
only be replicated to 0 nodes, instead of 1
java.io.IOException: File
/cygwin/tmp/hadoop-SYSTEM/mapred/system/job_201002011323_0001/job.jar could
only be replicated to 0 nodes, instead of 1
  at
org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1287)
  at
org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:351)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)




To maybe rule out something regarding ports or ssh , when I run netstat:

 TCP127.0.0.1:9000 0.0.0.0:0  LISTENING
 TCP127.0.0.1:9001 0.0.0.0:0  LISTENING


and when I browse to http://localhost:50070/


Cluster Summary

* * * 21 files and directories, 0 blocks = 21 total. Heap Size is 8.01 MB /
992.31 MB (0%)
*
Configured Capacity :   0 KB
DFS Used:   0 KB
Non DFS Used:   0 KB
DFS Remaining   :   0 KB
DFS Used%   :   100 %
DFS Remaining%  :   0 %
Live Nodes  :   0
Dead Nodes  :   0


so I'm a bit still in the dark, I guess.

Thanks
Brian




Aaron Kimball wrote:



Brian, it looks like you missed a step in the instructions. You'll need to
format the hdfs filesystem instance before starting the NameNode server:

You need to run:

$ bin/hadoop namenode -format

.. then you can do bin/start-dfs.sh
Hope this helps,
- Aaron


On Sat, Jan 30, 2010 at 12:27 AM, Brian Wolf  wrote:



  

Hi,

I am trying to run Hadoop 0.19.2 under cygwin as per directions on the
hadoop "quickstart" web page.

I know sshd is running and I can "ssh localhost" without a password.

This is from my hadoop-site.xml



hadoop.tmp.dir
/cygwin/tmp/hadoop-${user.name}


fs.default.name
hdfs://localhost:9000


mapred.job.tracker
localhost:9001


mapred.job.reuse.jvm.num.tasks
-1


dfs.replication
1


dfs.permissions
false


webinterface.private.actions
true



These are errors from my log files:


2010-01-30 00:03:33,091 INFO org.apache.hadoop.ipc.metrics.RpcMetrics:
Initializing RPC Metrics with hostName=NameNode, port=9000
2010-01-30 00:03:33,121 INFO
org.apache.hadoop.hdfs.server.namenode.NameNode: Namenode up at:
localhost/
127.0.0.1:9000
2010-01-30 00:03:33,161 INFO org.apache.hadoop.metrics.jvm.JvmMetrics:
Init

configuration file

2010-02-03 Thread Gang Luo
Hi,
I am writing script to run whole bunch of jobs automatically. But the 
configuration file doesn't seems working. I think there is something wrong in 
my command. 

The command is my script is like:
bin/hadoop jar myJarFile myClass -conf myConfigurationFilr.xml  arg1  agr2 

I use conf.get() so show the value of some parameters. But the values are not 
what I define in that xml file.  Is there something wrong? 

Thanks.
-Gang



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Maven and Mini MR Cluster

2010-02-03 Thread Michael Basnight
Im using maven to run all my unit tests, and i have a unit test that creates a 
mini mr cluster. When i create this cluster, i get classdefnotfound errors for 
the core hadoop libs (Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.Child). When i run the same test w/o creating the mini 
cluster, well.. it works fine. My HADOOP_HOME is set to the same version as my 
mvn repo, and points to a valid installation of hadoop. When i validate the 
classpath thru maven, (dependency:build-classpath), it says that the core libs 
are on the classpath as well (sourced from my .m2 repository). I just cant 
figure out why hadoop's mini cluster cant find those jars. Running hadoop 
0.20.0. 

Any suggestions?

Maven and Mini MR Cluster

2010-02-03 Thread Michael Basnight
Im using maven to run all my unit tests, and i have a unit test that creates a 
mini mr cluster. When i create this cluster, i get classdefnotfound errors for 
the core hadoop libs (Caused by: java.lang.ClassNotFoundException: 
org.apache.hadoop.mapred.Child). When i run the same test w/o creating the mini 
cluster, well.. it works fine. My HADOOP_HOME is set to the same version as my 
mvn repo, and points to a valid installation of hadoop. When i validate the 
classpath thru maven, (dependency:build-classpath), it says that the core libs 
are on the classpath as well (sourced from my .m2 repository). I just cant 
figure out why hadoop's mini cluster cant find those jars. Running hadoop 
0.20.0. 

Any suggestions?

Re: EOFException and BadLink, but file descriptors number is ok?

2010-02-03 Thread Meng Mao
also, which is the ulimit that's important, the one for the user who is
running the job, or the hadoop user that owns the Hadoop processes?

On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao  wrote:

> I've been trying to run a fairly small input file (300MB) on Cloudera
> Hadoop 0.20.1. The job I'm using probably writes to on the order of over
> 1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
> I get the following exception in the run logs:
>
> 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
> attempt_201001261532_1137_r_13_0, Status : FAILED
> java.io.EOFException
> at java.io.DataInputStream.readByte(DataInputStream.java:250)
> at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
> at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
> at org.apache.hadoop.io.Text.readString(Text.java:400)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>
> lots of EOFExceptions
>
> 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
> attempt_201001261532_1137_r_19_0, Status : FAILED
> java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
>  at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
> at
> org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)
>
> 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
> 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
> 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
> 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
> 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%
>
> From searching around, it seems like the most common cause of BadLink and
> EOFExceptions is when the nodes don't have enough file descriptors set. But
> across all the grid machines, the file-max has been set to 1573039.
> Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.
>
> Where else should I be looking for what's causing this?
>


Hadoop pipes and custom imput format

2010-02-03 Thread Mayra Mendoza
Hi,

I'm trying to run an c++ program with pipe (open and save images). I'm using
the WholeFileInputFormat, this is my execution line:

*Hadoop pipes -conf EscGris.xml -libjars pipe.jar -input input -output
output*

pipe.jar has WholeFileInputFormat, WholeFileRecordReader and
NullOutputFormat classes

*My configuration file is:*


  
  mapred.reduce.tasks
  0


hadoop.pipes.executable
EscGris


mapred.input.format.class
pipe.WholeFileInputFormat


mapred.output.format.class
pipe.NullOutputFormat


  hadoop.pipes.java.recordreader
  false


  hadoop.pipes.java.mapper
  false


  hadoop.pipes.java.reducer
  false


  keep.failed.task.files
  true


  mapred.system.dir
  /tmp/hadoop/mapred/system


  tmpjars
  home/training/ProyectoGrado/03_ImgEscGris/c++/pipe.jar



*But I get the following error:*

attempt_201002030521_0016_m_00_0:* Hadoop Pipes Exception: RecordReader
not defined* at
/home/oom/work/eclipse/hadoop-20/src/c++/pipes/impl/HadoopPipes.cc:691 in
virtual void HadoopPipes::TaskContextImpl::runMap(std::string, int, bool)
10/02/03 12:01:07 INFO mapred.JobClient: Task Id :
attempt_201002030521_0016_m_01_0, Status : FAILED
java.io.IOException: pipe child exception
at
org.apache.hadoop.mapred.pipes.Application.abort(Application.java:151)
at
org.apache.hadoop.mapred.pipes.PipesMapRunner.run(PipesMapRunner.java:101)
at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.io.EOFException
at java.io.DataInputStream.readByte(DataInputStream.java:250)
at org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
at
org.apache.hadoop.mapred.pipes.BinaryProtocol$UplinkReaderThread.run(BinaryProtocol.java:114)


*What am I doing wrong???*

Thank's a lot =0)


Encountered NullPointerException with the WordCount example, hadoop common v0.20.1

2010-02-03 Thread Frank Du
Dear All,

Please help with the NullPointerException in the WordCount example. Sorry it’s 
the simple code because I am new to hadoop. :-)

I am running v0.20.1 in Ubuntu 9.10.  The map task works perfectly. But there 
is NPE with the reduce task. Is there anything wrong with the configuration or 
the Reduce.java code?

Below are the error messages from job tracker, and the attached are the source 
code. Thank you so much!




Error: java.lang.NullPointerException

at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)



Error: java.lang.NullPointerException

at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)



Error: java.lang.NullPointerException

at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)



Error: java.lang.NullPointerException

at 
java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2683)

at 
org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2605)




- Frank



This email message and any attachments are for the sole use of the intended 
recipients and may contain proprietary and/or confidential information which 
may be privileged or otherwise protected from disclosure. Any unauthorized 
review, use, disclosure or distribution is prohibited. If you are not an 
intended recipient, please contact the sender by reply email and destroy the 
original message and any copies of the message as well as any attachments to 
the original message.


Re: sort at reduce side

2010-02-03 Thread Gang Luo
Thanks for reply, Sriguru.
So, after shuffle at reduce side,  are the spills actually stored as map files? 

Why I ask these questions is based on some observations as following. On a 16 
nodes cluster, when I do a map join, it takes 3 and a half minutes. When I do a 
reduce side join on nearly the same amount of data, it take 8 minutes before 
map phase complete. I am sure the computation (map function) will not cause so 
much difference, the extra 4 minutes time could be only spent on sorting at map 
side for reduce side join. While I also notice that the sort time at reduce 
side is only 30 sec (I cannot access the online jobtracker, the 30 sec time is 
actually the time reduce takes from 33% completeness to 66% completeness).  The 
number of reduce tasks is much fewer than that of map tasks, which means each 
reduce task sort more data than each map task (I use hash partitioner and data 
is uniformly distributed).  The only reason I come up with for the big 
difference between the sort at map side and reduce side is the different 
behaviors of these two sorts. 

Anybody has some ideas why the map takes so much time for reduce side join 
compared to map side join, and why there is big difference between sort at map 
side and reduce side?

P.S. I join a 7.5G file with a 100M file. the sort buffer at reduce is slightly 
large than that at map side.


-Gang



- 原始邮件 
发件人: Srigurunath Chakravarthi 
收件人: "common-user@hadoop.apache.org" 
发送日期: 2010/2/3 (周三) 12:50:08 上午
主   题: RE: sort at reduce side

Hi Gang,

>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them. 

afaik, no. Reduces always fetches map output data and not indexes (even if the 
data is from the local node, where an index may be sufficient).

Regards,
Sriguru

>-Original Message-
>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>Sent: Wednesday, February 03, 2010 10:40 AM
>To: common-user@hadoop.apache.org
>Subject: sort at reduce side
>
>Hi all,
>I want to know some more details about the sorting at the reduce side.
>
>The intermediate result generated at the map side is stored as map file
>which actually consists of two sub-files, namely index file and data file.
>The index file stores the keys and it could point to corresponding record
>stored in the data file.  What I think is that when intermediate result
>(even only part of it for each mapper) is shuffled to reducer, it is still
>kept in map file. If so, in order to efficiently sort the data, reducer
>actually only read the index part of each spill (which is a map file) and
>sort the keys, instead of reading whole records from disk and sort them.
>
>Does reducer actually do as what I expect?
>
>-Gang
>
>
>  ___
>  好玩贺卡等你发,邮箱贺卡全新上线!
>http://card.mail.cn.yahoo.com/ 


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Re: Example for using DistributedCache class

2010-02-03 Thread Udaya Lakshmi
Thanks nick. Its working.

Udaya

On 2/4/10, Jones, Nick  wrote:
> The files for the DC need to be on HDFS.
>
> Nick Jones
> Sent by radiation.
>
> On Feb 3, 2010, at 12:32 PM, "Udaya Lakshmi"  wrote:
>
>> Hi Nick,
>>  I am not able to start the following job. I have the file that has
>> to be
>> passed to distributedcache in the local filesystem of the task
>> tracker.
>>
>> Can you tell me if I am missing something?
>>
>> import org.apache.hadoop.fs.*;
>> import org.apache.hadoop.conf.*;
>> import org.apache.hadoop.mapred.*;
>> import org.apache.hadoop.io.*;
>> import org.apache.hadoop.mapred.*;
>> import org.apache.hadoop.mapred.lib.*;
>> import org.apache.hadoop.util.*;
>> import org.apache.hadoop.filecache.*;
>>
>> import java.io.*;
>> import java.util.*;
>> import java.text.SimpleDateFormat;
>> import java.net.*;
>>
>> public class Test extends Configured
>> {
>>
>>
>>
>>  public static class MapClass extends MapReduceBase implements
>> Mapper
>>  {
>>
>>   private FileSystem fs;
>>   private Path[] localFiles;
>>   private String str;
>>   public void configure(JobConf job)
>>   {
>> try{
>>   fs = FileSystem.getLocal(new Configuration());
>>   localFiles = DistributedCache.getLocalCacheFiles(job);
>>}
>> catch(IOException e){
>>   System.out.println("Exception while getting cached files");
>> }//catch(IOException e){
>>
>>   }//public void configure(JobConf job)
>>
>>public void map(LongWritable Key,Text
>> value,OutputCollector
>> output,Reporter reporter) throws IOException
>>{
>> BufferedReader readBuffer = new BufferedReader(new
>> FileReader(localFiles[0].toString()));
>> str = readBuffer.readLine();
>> output.collect(new Text(str),new Text(str));
>>}//public void map(LongWritable Key,Text
>> value,OutputCollector output,Reporter reporter) throws
>> IOException
>>
>>public void close() throws IOException
>>{
>>  fs.close();
>>}//public void close() throws IOException
>>
>>
>> }//public static class MapClass extends MapReduceBase implements
>> Mapper<>
>>
>>
>>
>>  public static class ReduceClass extends MapReduceBase implements
>> Reducer
>>  {
>>public void reduce(Text key,Iterator
>> values,OutputCollector output,Reporter reporter) throws
>> IOException
>>{
>>}//public void reduce(Text key,Iterator
>> values,OutputCollector output,Reporter reporter) throws
>> IOException
>>  }//public static class ReduceClass extends MapReduceBase implements
>> Reducer
>>
>>  public static void main(String[] args)
>>  {
>>JobConf conf = new JobConf(Test.class);
>>JobClient client = new JobClient();
>>conf.setMapperClass(Test.MapClass.class);
>>conf.setReducerClass(IdentityReducer.class);
>>conf.setOutputKeyClass(Text.class);
>>conf.setOutputValueClass(Text.class);
>>conf.setInputPath(new Path("input"));
>>conf.setOutputPath(new Path("output"));
>>try{
>>  DistributedCache.addCacheFile(new
>> URI("/home/udaya/hadoop-0.18.3/file_to_distribute"), conf);
>> }
>> catch(URISyntaxException e)
>> {}
>> try{
>> JobClient.runJob(conf);
>>   }
>> catch(Exception e)
>> {
>>  System.out.println("Error starting the job");
>> }
>>  }//public static void main(String[] args)
>> }//public class Test extends Configured implements Tools
>>
>> On Wed, Feb 3, 2010 at 7:27 PM, Nick Jones  wrote:
>>
>>> Hi Udaya,
>>> The following code uses already existing cache files as part of the
>>> map to
>>> process incoming data.  I apologize on the naming conventions, but
>>> the code
>>> had to be stripped.  I also removed several variable assignments,
>>> etc..
>>>
>>> public class MySpecialJob {
>>> public static class MyMapper extends MapReduceBase implements
>>> Mapper>> BigIntegerWritable> {
>>>
>>>   private Path[] dcfiles;
>>>   ...
>>>
>>>   public void configure(JobConf job) {
>>> // Load cached files
>>> dcfiles = new Path[0];
>>> try {
>>>   dcfiles = DistributedCache.getLocalCacheFiles(job);
>>> } catch (IOException ioe) {
>>>   System.err.println("Caught exception while getting cached
>>> files: " +
>>> StringUtils.stringifyException(ioe));
>>> }
>>>   }
>>>
>>>   public void map(LongWritable key, MyMapInputValueClass value,
>>> OutputCollector output,
>>> Reporter reporter) throws IOException {
>>> ...
>>> for (Path dcfile : dcfiles) {
>>>   if(dcfile.getName().equalsIgnoreCase(file_match)) {
>>> readbuffer = new BufferedReader(
>>>   new FileReader(dcfile.toString()));
>>> ...
>>> while((raw_line = readbuffer.readLine()) != null) {
>>>   ...
>>>
>>> public static void main(String[] args) throws Exception {
>>>   JobConf conf = new JobConf(MySpecialJob.class);
>>>   ...
>>>   DistributedCache.addCacheFile(new URI("/path/to/file1.txt"), conf);
>>>   DistributedCache.addCacheFile(new URI("/path/to/file2.txt"), conf);
>>>   DistributedCache.addCacheFile(new URI("/path/to/file3.txt

Re: setup cluster with cloudera repo

2010-02-03 Thread Todd Lipcon
Hi Jim,

Sorry about the broken links. We just launched a new website a couple days
ago and a few of the pages are still in transition.

This link should help you get started:

http://archive.cloudera.com/docs/cdh2-pseudo-distributed.html

Thanks
-Todd

On Wed, Feb 3, 2010 at 11:07 AM, Jim Kusznir  wrote:

> Hi all:
>
> I need to set up a hadoop cluster.  The cluster is based on CentOS
> 5.4, and I already have all the base OSes installed.
>
> I saw that Cloudera had a repo for hadoop CentOS, so I set up that
> repo, and installed hadoop via yum.  Unfortunately, I'm now at the
> "now what?" question.  Cloudera's website has many links to "confugre
> your cluster" or "continue", but that takes one to a page saying
> "we're redoing it, come back later".  This leaves me with no
> documentation to follow to actually make this cluster work.
>
> How do I proceed?
>
> Thanks!
> --Jim
>


RE: setup cluster with cloudera repo

2010-02-03 Thread Bill Habermaas
So you have hadoop installed and not configured/running.
I suggest you visit the hadoop website and review the QuickStart guide. 
You need to understand how to configure the system and then extrapolate to
your situation. 

Bill

-Original Message-
From: Jim Kusznir [mailto:jkusz...@gmail.com] 
Sent: Wednesday, February 03, 2010 2:08 PM
To: common-user
Subject: setup cluster with cloudera repo

Hi all:

I need to set up a hadoop cluster.  The cluster is based on CentOS
5.4, and I already have all the base OSes installed.

I saw that Cloudera had a repo for hadoop CentOS, so I set up that
repo, and installed hadoop via yum.  Unfortunately, I'm now at the
"now what?" question.  Cloudera's website has many links to "confugre
your cluster" or "continue", but that takes one to a page saying
"we're redoing it, come back later".  This leaves me with no
documentation to follow to actually make this cluster work.

How do I proceed?

Thanks!
--Jim




Re: setup cluster with cloudera repo

2010-02-03 Thread zaki rahaman
Are these on physical machines or are you by chance running on EC2?

On Wed, Feb 3, 2010 at 2:07 PM, Jim Kusznir  wrote:

> Hi all:
>
> I need to set up a hadoop cluster.  The cluster is based on CentOS
> 5.4, and I already have all the base OSes installed.
>
> I saw that Cloudera had a repo for hadoop CentOS, so I set up that
> repo, and installed hadoop via yum.  Unfortunately, I'm now at the
> "now what?" question.  Cloudera's website has many links to "confugre
> your cluster" or "continue", but that takes one to a page saying
> "we're redoing it, come back later".  This leaves me with no
> documentation to follow to actually make this cluster work.
>
> How do I proceed?
>
> Thanks!
> --Jim
>



-- 
Zaki Rahaman


setup cluster with cloudera repo

2010-02-03 Thread Jim Kusznir
Hi all:

I need to set up a hadoop cluster.  The cluster is based on CentOS
5.4, and I already have all the base OSes installed.

I saw that Cloudera had a repo for hadoop CentOS, so I set up that
repo, and installed hadoop via yum.  Unfortunately, I'm now at the
"now what?" question.  Cloudera's website has many links to "confugre
your cluster" or "continue", but that takes one to a page saying
"we're redoing it, come back later".  This leaves me with no
documentation to follow to actually make this cluster work.

How do I proceed?

Thanks!
--Jim


Re: Example for using DistributedCache class

2010-02-03 Thread Jones, Nick
The files for the DC need to be on HDFS.

Nick Jones
Sent by radiation.

On Feb 3, 2010, at 12:32 PM, "Udaya Lakshmi"  wrote:

> Hi Nick,
>  I am not able to start the following job. I have the file that has  
> to be
> passed to distributedcache in the local filesystem of the task  
> tracker.
>
> Can you tell me if I am missing something?
>
> import org.apache.hadoop.fs.*;
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.mapred.*;
> import org.apache.hadoop.mapred.lib.*;
> import org.apache.hadoop.util.*;
> import org.apache.hadoop.filecache.*;
>
> import java.io.*;
> import java.util.*;
> import java.text.SimpleDateFormat;
> import java.net.*;
>
> public class Test extends Configured
> {
>
>
>
>  public static class MapClass extends MapReduceBase implements
> Mapper
>  {
>
>   private FileSystem fs;
>   private Path[] localFiles;
>   private String str;
>   public void configure(JobConf job)
>   {
> try{
>   fs = FileSystem.getLocal(new Configuration());
>   localFiles = DistributedCache.getLocalCacheFiles(job);
>}
> catch(IOException e){
>   System.out.println("Exception while getting cached files");
> }//catch(IOException e){
>
>   }//public void configure(JobConf job)
>
>public void map(LongWritable Key,Text  
> value,OutputCollector
> output,Reporter reporter) throws IOException
>{
> BufferedReader readBuffer = new BufferedReader(new
> FileReader(localFiles[0].toString()));
> str = readBuffer.readLine();
> output.collect(new Text(str),new Text(str));
>}//public void map(LongWritable Key,Text
> value,OutputCollector output,Reporter reporter) throws
> IOException
>
>public void close() throws IOException
>{
>  fs.close();
>}//public void close() throws IOException
>
>
> }//public static class MapClass extends MapReduceBase implements  
> Mapper<>
>
>
>
>  public static class ReduceClass extends MapReduceBase implements
> Reducer
>  {
>public void reduce(Text key,Iterator
> values,OutputCollector output,Reporter reporter) throws
> IOException
>{
>}//public void reduce(Text key,Iterator
> values,OutputCollector output,Reporter reporter) throws
> IOException
>  }//public static class ReduceClass extends MapReduceBase implements
> Reducer
>
>  public static void main(String[] args)
>  {
>JobConf conf = new JobConf(Test.class);
>JobClient client = new JobClient();
>conf.setMapperClass(Test.MapClass.class);
>conf.setReducerClass(IdentityReducer.class);
>conf.setOutputKeyClass(Text.class);
>conf.setOutputValueClass(Text.class);
>conf.setInputPath(new Path("input"));
>conf.setOutputPath(new Path("output"));
>try{
>  DistributedCache.addCacheFile(new
> URI("/home/udaya/hadoop-0.18.3/file_to_distribute"), conf);
> }
> catch(URISyntaxException e)
> {}
> try{
> JobClient.runJob(conf);
>   }
> catch(Exception e)
> {
>  System.out.println("Error starting the job");
> }
>  }//public static void main(String[] args)
> }//public class Test extends Configured implements Tools
>
> On Wed, Feb 3, 2010 at 7:27 PM, Nick Jones  wrote:
>
>> Hi Udaya,
>> The following code uses already existing cache files as part of the  
>> map to
>> process incoming data.  I apologize on the naming conventions, but  
>> the code
>> had to be stripped.  I also removed several variable assignments,  
>> etc..
>>
>> public class MySpecialJob {
>> public static class MyMapper extends MapReduceBase implements
>> Mapper> BigIntegerWritable> {
>>
>>   private Path[] dcfiles;
>>   ...
>>
>>   public void configure(JobConf job) {
>> // Load cached files
>> dcfiles = new Path[0];
>> try {
>>   dcfiles = DistributedCache.getLocalCacheFiles(job);
>> } catch (IOException ioe) {
>>   System.err.println("Caught exception while getting cached  
>> files: " +
>> StringUtils.stringifyException(ioe));
>> }
>>   }
>>
>>   public void map(LongWritable key, MyMapInputValueClass value,
>> OutputCollector output,
>> Reporter reporter) throws IOException {
>> ...
>> for (Path dcfile : dcfiles) {
>>   if(dcfile.getName().equalsIgnoreCase(file_match)) {
>> readbuffer = new BufferedReader(
>>   new FileReader(dcfile.toString()));
>> ...
>> while((raw_line = readbuffer.readLine()) != null) {
>>   ...
>>
>> public static void main(String[] args) throws Exception {
>>   JobConf conf = new JobConf(MySpecialJob.class);
>>   ...
>>   DistributedCache.addCacheFile(new URI("/path/to/file1.txt"), conf);
>>   DistributedCache.addCacheFile(new URI("/path/to/file2.txt"), conf);
>>   DistributedCache.addCacheFile(new URI("/path/to/file3.txt"), conf);
>>   ...
>> }
>> }
>>
>> Nick Jones
>>
>>
>>
>> Udaya Lakshmi wrote:
>>
>>> Hi,
>>>  As a newbie to hadoop, I am not able to figure out how to use
>>> DistributedCache class. Can someone give me a small co

Re: Example for using DistributedCache class

2010-02-03 Thread Udaya Lakshmi
Hi Nick,
  I am not able to start the following job. I have the file that has to be
passed to distributedcache in the local filesystem of the task tracker.

 Can you tell me if I am missing something?

import org.apache.hadoop.fs.*;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.mapred.lib.*;
import org.apache.hadoop.util.*;
import org.apache.hadoop.filecache.*;

import java.io.*;
import java.util.*;
import java.text.SimpleDateFormat;
import java.net.*;

public class Test extends Configured
{



  public static class MapClass extends MapReduceBase implements
Mapper
  {

   private FileSystem fs;
   private Path[] localFiles;
   private String str;
   public void configure(JobConf job)
   {
 try{
   fs = FileSystem.getLocal(new Configuration());
   localFiles = DistributedCache.getLocalCacheFiles(job);
}
 catch(IOException e){
   System.out.println("Exception while getting cached files");
 }//catch(IOException e){

   }//public void configure(JobConf job)

public void map(LongWritable Key,Text value,OutputCollector
output,Reporter reporter) throws IOException
{
 BufferedReader readBuffer = new BufferedReader(new
FileReader(localFiles[0].toString()));
 str = readBuffer.readLine();
 output.collect(new Text(str),new Text(str));
}//public void map(LongWritable Key,Text
value,OutputCollector output,Reporter reporter) throws
IOException

public void close() throws IOException
{
  fs.close();
}//public void close() throws IOException


}//public static class MapClass extends MapReduceBase implements Mapper<>



  public static class ReduceClass extends MapReduceBase implements
Reducer
  {
public void reduce(Text key,Iterator
values,OutputCollector output,Reporter reporter) throws
IOException
{
}//public void reduce(Text key,Iterator
values,OutputCollector output,Reporter reporter) throws
IOException
  }//public static class ReduceClass extends MapReduceBase implements
Reducer

  public static void main(String[] args)
  {
JobConf conf = new JobConf(Test.class);
JobClient client = new JobClient();
conf.setMapperClass(Test.MapClass.class);
conf.setReducerClass(IdentityReducer.class);
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setInputPath(new Path("input"));
conf.setOutputPath(new Path("output"));
try{
  DistributedCache.addCacheFile(new
URI("/home/udaya/hadoop-0.18.3/file_to_distribute"), conf);
 }
 catch(URISyntaxException e)
 {}
 try{
 JobClient.runJob(conf);
   }
 catch(Exception e)
 {
  System.out.println("Error starting the job");
 }
  }//public static void main(String[] args)
}//public class Test extends Configured implements Tools

On Wed, Feb 3, 2010 at 7:27 PM, Nick Jones  wrote:

> Hi Udaya,
> The following code uses already existing cache files as part of the map to
> process incoming data.  I apologize on the naming conventions, but the code
> had to be stripped.  I also removed several variable assignments, etc..
>
> public class MySpecialJob {
>  public static class MyMapper extends MapReduceBase implements
> Mapper BigIntegerWritable> {
>
>private Path[] dcfiles;
>...
>
>public void configure(JobConf job) {
>  // Load cached files
>  dcfiles = new Path[0];
>  try {
>dcfiles = DistributedCache.getLocalCacheFiles(job);
>  } catch (IOException ioe) {
>System.err.println("Caught exception while getting cached files: " +
> StringUtils.stringifyException(ioe));
>  }
>}
>
>public void map(LongWritable key, MyMapInputValueClass value,
>  OutputCollector output,
>  Reporter reporter) throws IOException {
>  ...
>  for (Path dcfile : dcfiles) {
>if(dcfile.getName().equalsIgnoreCase(file_match)) {
>  readbuffer = new BufferedReader(
>new FileReader(dcfile.toString()));
>  ...
>  while((raw_line = readbuffer.readLine()) != null) {
>...
>
>  public static void main(String[] args) throws Exception {
>JobConf conf = new JobConf(MySpecialJob.class);
>...
>DistributedCache.addCacheFile(new URI("/path/to/file1.txt"), conf);
>DistributedCache.addCacheFile(new URI("/path/to/file2.txt"), conf);
>DistributedCache.addCacheFile(new URI("/path/to/file3.txt"), conf);
>...
>  }
> }
>
> Nick Jones
>
>
>
> Udaya Lakshmi wrote:
>
>> Hi,
>>   As a newbie to hadoop, I am not able to figure out how to use
>> DistributedCache class. Can someone give me a small code which
>> distributes file to the cluster and the show how to open and use the
>> file in the map or reduce task.
>> Thanks,
>> Udaya
>>
>>
>


Re: Hadoop cluster setup

2010-02-03 Thread Kay Kay


On 2/3/10 6:12 AM, janani venkat wrote:

Thank you..
And can you give me a tutorial kinda thing to connect them using LAN?


Start with buying the cable.

The people in the list are happy to assist you with hadoop-related 
questions.


Feel free to use a search engine of your choice to get the relevant 
background as pointed to in a given thread.  You got to do your own home 
work.





Becos
am very new to this. It would be of greatly useful to me

On Wed, Feb 3, 2010 at 7:28 PM, Habermaas, William<
william.haberm...@fatwire.com>  wrote:

   

You can setup the machines and configure them without being connected
over a network. But once you want to start up the services all machines
have to be active and reachable on the LAN.

Bill


-Original Message-
From: janani venkat [mailto:janani.cs...@gmail.com]
Sent: Wednesday, February 03, 2010 8:51 AM
To: common-user@hadoop.apache.org
Subject: Hadoop cluster setup

Hi
I'm a beginner in working with hadoop. I want to know if we have to
physically connect the machines using LAN cable before setting up the
cluster.
Urgently needed to clarify this and start my work\
Regards
Janani

 
   




RE: aws

2010-02-03 Thread Sirota, Peter
Elastic MapReduce uses Hadoop .18.3 with several patches that improve S3N 
performance/reliability.  



-Original Message-
From: Kay Kay [mailto:kaykay.uni...@gmail.com] 
Sent: Wednesday, February 03, 2010 9:43 AM
To: common-user@hadoop.apache.org
Subject: Re: aws

Peter,
   Out of curiosity - What is the version of  Hadoop DFS and M-R are 
being used behind the scenes ?


On 2/2/10 11:26 PM, Sirota, Peter wrote:
> Hi Brian,
>
> AWS has Elastic MapReduce service where you can run Hadoop starting at
> 10 cents per hour.  Check it out at
> http://aws.amazon.com/elasticmapreduce
>
> Disclamer: I work at AWS
>
>
> Sent from my phone
>
> On Feb 2, 2010, at 11:09 PM, "Brian Wolf"  wrote:
>
>
>> Hi,
>>
>> Can anybody tell me if there aws/amazon has any  kind of hadoop
>> sandbox
>> to play in for free?
>>
>> Thanks
>>
>> Brian
>>
>>
>>  



Re: aws

2010-02-03 Thread Kay Kay

Peter,
  Out of curiosity - What is the version of  Hadoop DFS and M-R are 
being used behind the scenes ?



On 2/2/10 11:26 PM, Sirota, Peter wrote:

Hi Brian,

AWS has Elastic MapReduce service where you can run Hadoop starting at
10 cents per hour.  Check it out at
http://aws.amazon.com/elasticmapreduce

Disclamer: I work at AWS


Sent from my phone

On Feb 2, 2010, at 11:09 PM, "Brian Wolf"  wrote:

   

Hi,

Can anybody tell me if there aws/amazon has any  kind of hadoop
sandbox
to play in for free?

Thanks

Brian


 




Re: Hadoop cluster setup

2010-02-03 Thread Edward Capriolo
On Wed, Feb 3, 2010 at 9:12 AM, janani venkat  wrote:
> Thank you..
> And can you give me a tutorial kinda thing to connect them using LAN? Becos
> am very new to this. It would be of greatly useful to me
>
> On Wed, Feb 3, 2010 at 7:28 PM, Habermaas, William <
> william.haberm...@fatwire.com> wrote:
>
>> You can setup the machines and configure them without being connected
>> over a network. But once you want to start up the services all machines
>> have to be active and reachable on the LAN.
>>
>> Bill
>>
>>
>> -Original Message-
>> From: janani venkat [mailto:janani.cs...@gmail.com]
>> Sent: Wednesday, February 03, 2010 8:51 AM
>> To: common-user@hadoop.apache.org
>> Subject: Hadoop cluster setup
>>
>> Hi
>> I'm a beginner in working with hadoop. I want to know if we have to
>> physically connect the machines using LAN cable before setting up the
>> cluster.
>> Urgently needed to clarify this and start my work\
>> Regards
>> Janani
>>
>

You should connect them the way you would connect any computers.
Hadoop setup documents assume you familiar with computer networking
(IP) fundamentals. If you are not proficient in this you should
contact your network administrator for help.


Re: sort at reduce side

2010-02-03 Thread Edward Capriolo
2010/2/3 Srigurunath Chakravarthi :
> Hi Gang,
>
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>
>  afaik, no. Reduces always fetches map output data and not indexes (even if 
> the data is from the local node, where an index may be sufficient).
>
> Regards,
> Sriguru
>
>>-Original Message-
>>From: Gang Luo [mailto:lgpub...@yahoo.com.cn]
>>Sent: Wednesday, February 03, 2010 10:40 AM
>>To: common-user@hadoop.apache.org
>>Subject: sort at reduce side
>>
>>Hi all,
>>I want to know some more details about the sorting at the reduce side.
>>
>>The intermediate result generated at the map side is stored as map file
>>which actually consists of two sub-files, namely index file and data file.
>>The index file stores the keys and it could point to corresponding record
>>stored in the data file.  What I think is that when intermediate result
>>(even only part of it for each mapper) is shuffled to reducer, it is still
>>kept in map file. If so, in order to efficiently sort the data, reducer
>>actually only read the index part of each spill (which is a map file) and
>>sort the keys, instead of reading whole records from disk and sort them.
>>
>>Does reducer actually do as what I expect?
>>
>>-Gang
>>
>>
>>  ___
>>  好玩贺卡等你发,邮箱贺卡全新上线!
>>http://card.mail.cn.yahoo.com/
>

With .20 and the TotalOrderPartioner isn't reduce side sorting
possible now? Is that support we can/should add to hive?


Re: Hadoop cluster setup

2010-02-03 Thread janani venkat
Thank you..
And can you give me a tutorial kinda thing to connect them using LAN? Becos
am very new to this. It would be of greatly useful to me

On Wed, Feb 3, 2010 at 7:28 PM, Habermaas, William <
william.haberm...@fatwire.com> wrote:

> You can setup the machines and configure them without being connected
> over a network. But once you want to start up the services all machines
> have to be active and reachable on the LAN.
>
> Bill
>
>
> -Original Message-
> From: janani venkat [mailto:janani.cs...@gmail.com]
> Sent: Wednesday, February 03, 2010 8:51 AM
> To: common-user@hadoop.apache.org
> Subject: Hadoop cluster setup
>
> Hi
> I'm a beginner in working with hadoop. I want to know if we have to
> physically connect the machines using LAN cable before setting up the
> cluster.
> Urgently needed to clarify this and start my work\
> Regards
> Janani
>


Re: Example for using DistributedCache class

2010-02-03 Thread Nick Jones

Hi Udaya,
The following code uses already existing cache files as part of the map 
to process incoming data.  I apologize on the naming conventions, but 
the code had to be stripped.  I also removed several variable 
assignments, etc..


public class MySpecialJob {
  public static class MyMapper extends MapReduceBase implements 
MapperBigIntegerWritable> {


private Path[] dcfiles;
...

public void configure(JobConf job) {
  // Load cached files
  dcfiles = new Path[0];
  try {
dcfiles = DistributedCache.getLocalCacheFiles(job);
  } catch (IOException ioe) {
System.err.println("Caught exception while getting cached 
files: " + StringUtils.stringifyException(ioe));

  }
}

public void map(LongWritable key, MyMapInputValueClass value,
  OutputCollector output,
  Reporter reporter) throws IOException {
  ...
  for (Path dcfile : dcfiles) {
if(dcfile.getName().equalsIgnoreCase(file_match)) {
  readbuffer = new BufferedReader(
new FileReader(dcfile.toString()));
  ...
  while((raw_line = readbuffer.readLine()) != null) {
...

  public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(MySpecialJob.class);
... 
DistributedCache.addCacheFile(new URI("/path/to/file1.txt"), conf);
DistributedCache.addCacheFile(new URI("/path/to/file2.txt"), conf);
DistributedCache.addCacheFile(new URI("/path/to/file3.txt"), conf);
...
  }
}

Nick Jones


Udaya Lakshmi wrote:

Hi,
   As a newbie to hadoop, I am not able to figure out how to use
DistributedCache class. Can someone give me a small code which
distributes file to the cluster and the show how to open and use the
file in the map or reduce task.
Thanks,
Udaya





RE: Hadoop cluster setup

2010-02-03 Thread Habermaas, William
You can setup the machines and configure them without being connected
over a network. But once you want to start up the services all machines
have to be active and reachable on the LAN.

Bill


-Original Message-
From: janani venkat [mailto:janani.cs...@gmail.com] 
Sent: Wednesday, February 03, 2010 8:51 AM
To: common-user@hadoop.apache.org
Subject: Hadoop cluster setup

Hi
I'm a beginner in working with hadoop. I want to know if we have to
physically connect the machines using LAN cable before setting up the
cluster.
Urgently needed to clarify this and start my work\
Regards
Janani


Hadoop cluster setup

2010-02-03 Thread janani venkat
Hi
I'm a beginner in working with hadoop. I want to know if we have to
physically connect the machines using LAN cable before setting up the
cluster.
Urgently needed to clarify this and start my work\
Regards
Janani


Example for using DistributedCache class

2010-02-03 Thread Udaya Lakshmi
Hi,
   As a newbie to hadoop, I am not able to figure out how to use
DistributedCache class. Can someone give me a small code which
distributes file to the cluster and the show how to open and use the
file in the map or reduce task.
Thanks,
Udaya


Re: aws

2010-02-03 Thread 松柳
I also know there's a way to run hadoop on EC2 using script provided in
hadoop package.

Here is the method.

http://wiki.apache.org/hadoop/AmazonEC2

2010/2/3 Brian Wolf 

> Now there's a deal! Thanks
>
> Sirota, Peter wrote:
>
>> Hi Brian,
>>
>> AWS has Elastic MapReduce service where you can run Hadoop starting at  10
>> cents per hour.  Check it out at
>> http://aws.amazon.com/ank
>>
>>
>> Disclamer: I work at AWS
>>
>>
>> Sent from my phone
>>
>> On Feb 2, 2010, at 11:09 PM, "Brian Wolf"  wrote:
>>
>>
>>
>>> Hi,
>>>
>>> Can anybody tell me if there aws/amazon has any  kind of hadoop  sandbox
>>> to play in for free?
>>>
>>> Thanks
>>>
>>> Brian
>>>
>>>
>>>
>>>
>>
>


Re: [RFH][Announce] hadoop on its way into Debian

2010-02-03 Thread stephen mulcahy

Thomas Koch wrote:

Will this depend on sun java since that's the only java recommended by
the Hadoop team?
It depends on a Debian specific meta package java6-runtime, which by default 
resolves to openjdk-6. But everybody is free to install sun-java instead, 
which also provides java6-runtime.
If there would be a hard dependency on sun-java, then hadoop could not enter 
the main repository of Debian, since sun-java is not free.


This makes sense - thanks for your efforts on this.

-stephen

--
Stephen Mulcahy, DI2, Digital Enterprise Research Institute,
NUI Galway, IDA Business Park, Lower Dangan, Galway, Ireland
http://di2.deri.iehttp://webstar.deri.iehttp://sindice.com


RE: Yahoo! presents First India Hadoop Summit on Feb 28, 2010 in Bangalore

2010-02-03 Thread Preeti Priyadarshini
Happy to announce that registration is open for India Hadoop Summit, Bangalore. 
Below is the URL of the event:

http://www.cloudcamp.org/Bangalore   ( Venue details will be updated soon)

Please publicize about the same through your blogs, twitter 
(#indiahadoopsummit2010)


From: Preeti Priyadarshini
Sent: Tuesday, January 26, 2010 2:31 PM
To: 'common-...@hadoop.apache.org'; 'common-user@hadoop.apache.org'
Subject: Yahoo! presents First India Hadoop Summit on Feb 28, 2010 in Bangalore


Yahoo! India invites you to join the first India Hadoop Summit on Feb 28, 2010 
in Bangalore.

This day-long event is going to be co-hosted with CloudCamp Bangalore 2010. 
Hadoop will be a dedicated track in this session.

In this event you will find participants from Yahoo! India Hadoop team, 
Industry experts, leading universities. This event brings together leaders from 
the Hadoop developer and user communities.

Speakers will cover a rich variety of topics including the current state of 
Hadoop development and deployment, PIG, performance optimization of Hadoop 
Cluster, testing in Hadoop, real-world case study and Hadoop in Academia 
Research.

Please watch out for more updates including registration and day's agenda!!

Thanks & Regards
_
Preeti Priyadarshini | Program Manager | Grid Computing
Yahoo! India R&D, Torrey Pines, EGL Park, Bangalore-560071
Direct: + 91-80-30774957
Email: pree...@yahoo-inc.com | Web 
www.yahoo.com
Board: + 91-80-3077, Fax: +91-80-30774455