Re: Re: Inverse of a matrix using Map - Reduce

2010-02-04 Thread Tran Son
Hi,
As far as i know, inversion of matrix need a lot of loops which are not 
supported well in Hadoop MapRed. Hadoop MapRed is working well with block 
algorithms, especially for simple operations such are addition, transposition 
and possibly multiplication. However, with inversion, there is no (as I search 
up to now) algorithms support blocking, i mean working in each small parts of 
matrix and combine to final result. There are several algorithms such as 
Gaussian (as you said) or Csanky, but i think you will need more conplex 
implementation with ChainMapper/ChainReducer and/or using multiple contrained 
job.
However, I think it does not effective much and convenient. So I am developing 
another version of Map Reduce which support staging of Reducer: 1 job = Mapper 
Reducer*. I test it with Csanky algorithm and it work quite well but I'm still 
on the way to improve the scheduling mechanism.






From: aa...@buffalo.edu aa...@buffalo.edu
To: common-user@hadoop.apache.org; aa...@buffalo.edu; Ganesh Swami 
gan...@iamganesh.com
Sent: Thursday, February 4, 2010 3:57:39
Subject: Re: Re: Inverse of a matrix using Map - Reduce

Hi,
   Any idea how this method will scale for dense matrices ?The kind of matrices 
I
am going to be working with are 500,000*500,000. Will this be a problem. Also
have you used this patch ?

Best Regards from Buffalo

Abhishek Agrawal

SUNY- Buffalo
(716-435-7122)

On Wed 02/03/10  1:41 AM , Ganesh Swami gan...@iamganesh.com sent:
 What about the Moore-Penrose inverse?
 
 http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse
 
 The pseudo-inverse coincides with the regular inverse when the matrix
 is non-singular. Moreover, it can be computed using the SVD.
 
 Here's a patch for a MapReduce version of the SVD:
 https://issues.apache.org/jira/browse/MAHOUT-180
 Ganesh
 
 On Tue, Feb 2, 2010 at 10:11 PM,  aa...@buffa
 lo.edu wrote: Hello People,
  Â  Â  Â 
 Â  Â  Â My name is Abhishek Agrawal. For
 the last few days I have been trying to figure out how to calculate the
inverse of a
 matrix using Map Reduce. Matrix inversion has 2 common approaches. Gaussian-
 Jordan and the cofactor of transpose method. But both of them dont seem to be
suited
 too well for Map- Reduce. Gaussian Jordan involves blocking co factoring a
 matrix requires repeated calculation of determinant.
 
  Can some one give me any pointers so as to how
 to solve this problem ?
  Best Regards from Buffalo
 
  Abhishek Agrawal
 
  SUNY- Buffalo
  (716-435-7122)
 
 
 
 
 
 
 
 
 


  New Email names for you! 
Get the Email name you#39;ve always wanted on the new @ymail and @rocketmail. 
Hurry before someone else does!
http://mail.promotions.yahoo.com/newdomains/aa/

Re: Job Tracker questions

2010-02-04 Thread Mark N
Ye currently am using jobclient to read these counters.

But We are not able to use *webservices *because the jar which is used to
read the counters from  running hadoop job  is itself a Hadoop program

If we could have pure Java Api which is run without hadoop command then we
could return the counter variable into webservices and show in UI.

Any help  or technique to show thsese counters in the UI would be
appreciated  ( not necessarily using web service )


I am using webservices because I am having .net VB client

thanks



On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:

 I think you can use JobClient to get the counters in your web service.
 If you look at the shell script bin/hadoop, you will find that actually
 this
 shell use the JobClient to get the counters.



 On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com wrote:

  We have a hadoop job running and have used custom counters to track  few
  counters ( like no of successfully processed documents matching certain
  conditions)
 
 
  Since we need to get this counters even while the Hadoop job is running ,
  we
  wrote another Java program to read these counters
 
 
  *Counter reader  program *will do the following :
 
 
  1)  List all the running jobs.
 
  2)   Get the running job using Job name
 
  2) Get all the counter for individual running jobs
 
  3)  Set this counters in variables.
 We could successfully read these counters  , but since we need to
  show these counters to custom UI , how can we show these counters?
 
 we looked into various options to read these counters to show in
 UI
  as following :
 
   1. Dump these counters to database , however this may be overhead
   2. Write web service   and UI will invoke the functions from these
  service to show in UI ( However since we need to run *Counter reader
  program   *with Hadoop command it might not be feasible to write web
  service ?   )
 
   so the question is can we achive to read the counters using simple
  Java APIs ? Does anyone have idea how does the default jobtracker JSP
 works
  ? we wanted to built something similar to this
 
  thanks
 
 
 
  --
  Nipen Mark
 



 --
 Best Regards

 Jeff Zhang




-- 
Nipen Mark


Re: Job Tracker questions

2010-02-04 Thread Jeff Zhang
Well, you can create a proxy of JobTracker in client side, and then you can
use the API of JobTracker to get the information of jobs. The Proxy take the
responsibility of  communication with the Master Node.  Read the source code
of JobClient can help you.


On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:

 Ye currently am using jobclient to read these counters.

 But We are not able to use *webservices *because the jar which is used to
 read the counters from  running hadoop job  is itself a Hadoop program

 If we could have pure Java Api which is run without hadoop command then we
 could return the counter variable into webservices and show in UI.

 Any help  or technique to show thsese counters in the UI would be
 appreciated  ( not necessarily using web service )


 I am using webservices because I am having .net VB client

 thanks



 On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:

  I think you can use JobClient to get the counters in your web service.
  If you look at the shell script bin/hadoop, you will find that actually
  this
  shell use the JobClient to get the counters.
 
 
 
  On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com wrote:
 
   We have a hadoop job running and have used custom counters to track
  few
   counters ( like no of successfully processed documents matching certain
   conditions)
  
  
   Since we need to get this counters even while the Hadoop job is running
 ,
   we
   wrote another Java program to read these counters
  
  
   *Counter reader  program *will do the following :
  
  
   1)  List all the running jobs.
  
   2)   Get the running job using Job name
  
   2) Get all the counter for individual running jobs
  
   3)  Set this counters in variables.
  We could successfully read these counters  , but since we need
 to
   show these counters to custom UI , how can we show these counters?
  
  we looked into various options to read these counters to show in
  UI
   as following :
  
1. Dump these counters to database , however this may be overhead
2. Write web service   and UI will invoke the functions from these
   service to show in UI ( However since we need to run *Counter reader
   program   *with Hadoop command it might not be feasible to write web
   service ?   )
  
so the question is can we achive to read the counters using simple
   Java APIs ? Does anyone have idea how does the default jobtracker JSP
  works
   ? we wanted to built something similar to this
  
   thanks
  
  
  
   --
   Nipen Mark
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark




-- 
Best Regards

Jeff Zhang


Re: configuration file

2010-02-04 Thread Amogh Vasekar
Hi,
A shot in the dark, is the conf file in your classpath? If yes, are the 
parameters you are trying to override marked final?

Amogh


On 2/4/10 3:18 AM, Gang Luo lgpub...@yahoo.com.cn wrote:

Hi,
I am writing script to run whole bunch of jobs automatically. But the 
configuration file doesn't seems working. I think there is something wrong in 
my command.

The command is my script is like:
bin/hadoop jar myJarFile myClass -conf myConfigurationFilr.xml  arg1  agr2 

I use conf.get() so show the value of some parameters. But the values are not 
what I define in that xml file.  Is there something wrong?

Thanks.
-Gang




Re: Job Tracker questions

2010-02-04 Thread Mark N
could you please elaborate on this  ( * hint to get started  as am very new
to hadoop? )
So far I could succesfully read all the default and custom counters.

Currently we are having a .net client.

thanks in advance.


On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:

 Well, you can create a proxy of JobTracker in client side, and then you can
 use the API of JobTracker to get the information of jobs. The Proxy take
 the
 responsibility of  communication with the Master Node.  Read the source
 code
 of JobClient can help you.


 On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:

  Ye currently am using jobclient to read these counters.
 
  But We are not able to use *webservices *because the jar which is used to
  read the counters from  running hadoop job  is itself a Hadoop program
 
  If we could have pure Java Api which is run without hadoop command then
 we
  could return the counter variable into webservices and show in UI.
 
  Any help  or technique to show thsese counters in the UI would be
  appreciated  ( not necessarily using web service )
 
 
  I am using webservices because I am having .net VB client
 
  thanks
 
 
 
  On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:
 
   I think you can use JobClient to get the counters in your web service.
   If you look at the shell script bin/hadoop, you will find that actually
   this
   shell use the JobClient to get the counters.
  
  
  
   On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com wrote:
  
We have a hadoop job running and have used custom counters to track
   few
counters ( like no of successfully processed documents matching
 certain
conditions)
   
   
Since we need to get this counters even while the Hadoop job is
 running
  ,
we
wrote another Java program to read these counters
   
   
*Counter reader  program *will do the following :
   
   
1)  List all the running jobs.
   
2)   Get the running job using Job name
   
2) Get all the counter for individual running jobs
   
3)  Set this counters in variables.
   We could successfully read these counters  , but since we need
  to
show these counters to custom UI , how can we show these counters?
   
   we looked into various options to read these counters to show
 in
   UI
as following :
   
 1. Dump these counters to database , however this may be
 overhead
 2. Write web service   and UI will invoke the functions from
 these
service to show in UI ( However since we need to run *Counter reader
program   *with Hadoop command it might not be feasible to write web
service ?   )
   
 so the question is can we achive to read the counters using
 simple
Java APIs ? Does anyone have idea how does the default jobtracker JSP
   works
? we wanted to built something similar to this
   
thanks
   
   
   
--
Nipen Mark
   
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
 
 
 
  --
  Nipen Mark
 



 --
 Best Regards

 Jeff Zhang




-- 
Nipen Mark


Re: Job Tracker questions

2010-02-04 Thread Jeff Zhang
Do you mean want to connect the JobTracker using .Net ? If so, I'm afraid I
have no idea how to this. The rpc of hadoop is language dependent.



On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com wrote:

 could you please elaborate on this  ( * hint to get started  as am very new
 to hadoop? )
 So far I could succesfully read all the default and custom counters.

 Currently we are having a .net client.

 thanks in advance.


 On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:

  Well, you can create a proxy of JobTracker in client side, and then you
 can
  use the API of JobTracker to get the information of jobs. The Proxy take
  the
  responsibility of  communication with the Master Node.  Read the source
  code
  of JobClient can help you.
 
 
  On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:
 
   Ye currently am using jobclient to read these counters.
  
   But We are not able to use *webservices *because the jar which is used
 to
   read the counters from  running hadoop job  is itself a Hadoop program
  
   If we could have pure Java Api which is run without hadoop command then
  we
   could return the counter variable into webservices and show in UI.
  
   Any help  or technique to show thsese counters in the UI would be
   appreciated  ( not necessarily using web service )
  
  
   I am using webservices because I am having .net VB client
  
   thanks
  
  
  
   On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:
  
I think you can use JobClient to get the counters in your web
 service.
If you look at the shell script bin/hadoop, you will find that
 actually
this
shell use the JobClient to get the counters.
   
   
   
On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com wrote:
   
 We have a hadoop job running and have used custom counters to track
few
 counters ( like no of successfully processed documents matching
  certain
 conditions)


 Since we need to get this counters even while the Hadoop job is
  running
   ,
 we
 wrote another Java program to read these counters


 *Counter reader  program *will do the following :


 1)  List all the running jobs.

 2)   Get the running job using Job name

 2) Get all the counter for individual running jobs

 3)  Set this counters in variables.
We could successfully read these counters  , but since we
 need
   to
 show these counters to custom UI , how can we show these counters?

we looked into various options to read these counters to
 show
  in
UI
 as following :

  1. Dump these counters to database , however this may be
  overhead
  2. Write web service   and UI will invoke the functions from
  these
 service to show in UI ( However since we need to run *Counter
 reader
 program   *with Hadoop command it might not be feasible to write
 web
 service ?   )

  so the question is can we achive to read the counters using
  simple
 Java APIs ? Does anyone have idea how does the default jobtracker
 JSP
works
 ? we wanted to built something similar to this

 thanks



 --
 Nipen Mark

   
   
   
--
Best Regards
   
Jeff Zhang
   
  
  
  
   --
   Nipen Mark
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark




-- 
Best Regards

Jeff Zhang


Re: Job Tracker questions

2010-02-04 Thread Jeff Zhang
I think you can create web service using Java, and then in .net using the
web service to display the result.


On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com wrote:

 Do you mean want to connect the JobTracker using .Net ? If so, I'm afraid I
 have no idea how to this. The rpc of hadoop is language dependent.




 On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com wrote:

 could you please elaborate on this  ( * hint to get started  as am very
 new
 to hadoop? )
 So far I could succesfully read all the default and custom counters.

 Currently we are having a .net client.

 thanks in advance.


 On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:

  Well, you can create a proxy of JobTracker in client side, and then you
 can
  use the API of JobTracker to get the information of jobs. The Proxy take
  the
  responsibility of  communication with the Master Node.  Read the source
  code
  of JobClient can help you.
 
 
  On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:
 
   Ye currently am using jobclient to read these counters.
  
   But We are not able to use *webservices *because the jar which is used
 to
   read the counters from  running hadoop job  is itself a Hadoop program
  
   If we could have pure Java Api which is run without hadoop command
 then
  we
   could return the counter variable into webservices and show in UI.
  
   Any help  or technique to show thsese counters in the UI would be
   appreciated  ( not necessarily using web service )
  
  
   I am using webservices because I am having .net VB client
  
   thanks
  
  
  
   On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:
  
I think you can use JobClient to get the counters in your web
 service.
If you look at the shell script bin/hadoop, you will find that
 actually
this
shell use the JobClient to get the counters.
   
   
   
On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com
 wrote:
   
 We have a hadoop job running and have used custom counters to
 track
few
 counters ( like no of successfully processed documents matching
  certain
 conditions)


 Since we need to get this counters even while the Hadoop job is
  running
   ,
 we
 wrote another Java program to read these counters


 *Counter reader  program *will do the following :


 1)  List all the running jobs.

 2)   Get the running job using Job name

 2) Get all the counter for individual running jobs

 3)  Set this counters in variables.
We could successfully read these counters  , but since we
 need
   to
 show these counters to custom UI , how can we show these counters?

we looked into various options to read these counters to
 show
  in
UI
 as following :

  1. Dump these counters to database , however this may be
  overhead
  2. Write web service   and UI will invoke the functions from
  these
 service to show in UI ( However since we need to run *Counter
 reader
 program   *with Hadoop command it might not be feasible to write
 web
 service ?   )

  so the question is can we achive to read the counters using
  simple
 Java APIs ? Does anyone have idea how does the default jobtracker
 JSP
works
 ? we wanted to built something similar to this

 thanks



 --
 Nipen Mark

   
   
   
--
Best Regards
   
Jeff Zhang
   
  
  
  
   --
   Nipen Mark
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark




 --
 Best Regards

 Jeff Zhang




-- 
Best Regards

Jeff Zhang


Re: Job Tracker questions

2010-02-04 Thread Jeff Zhang
You can use org.apache.hadoop.ipc.RPC.getProxy() to initialize the proxy
of JobTracker



On Thu, Feb 4, 2010 at 7:23 PM, Jeff Zhang zjf...@gmail.com wrote:

 I think you can create web service using Java, and then in .net using the
 web service to display the result.



 On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com wrote:

 Do you mean want to connect the JobTracker using .Net ? If so, I'm afraid
 I have no idea how to this. The rpc of hadoop is language dependent.




 On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com wrote:

 could you please elaborate on this  ( * hint to get started  as am very
 new
 to hadoop? )
 So far I could succesfully read all the default and custom counters.

 Currently we are having a .net client.

 thanks in advance.


 On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:

  Well, you can create a proxy of JobTracker in client side, and then you
 can
  use the API of JobTracker to get the information of jobs. The Proxy
 take
  the
  responsibility of  communication with the Master Node.  Read the source
  code
  of JobClient can help you.
 
 
  On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:
 
   Ye currently am using jobclient to read these counters.
  
   But We are not able to use *webservices *because the jar which is
 used to
   read the counters from  running hadoop job  is itself a Hadoop
 program
  
   If we could have pure Java Api which is run without hadoop command
 then
  we
   could return the counter variable into webservices and show in UI.
  
   Any help  or technique to show thsese counters in the UI would be
   appreciated  ( not necessarily using web service )
  
  
   I am using webservices because I am having .net VB client
  
   thanks
  
  
  
   On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com wrote:
  
I think you can use JobClient to get the counters in your web
 service.
If you look at the shell script bin/hadoop, you will find that
 actually
this
shell use the JobClient to get the counters.
   
   
   
On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com
 wrote:
   
 We have a hadoop job running and have used custom counters to
 track
few
 counters ( like no of successfully processed documents matching
  certain
 conditions)


 Since we need to get this counters even while the Hadoop job is
  running
   ,
 we
 wrote another Java program to read these counters


 *Counter reader  program *will do the following :


 1)  List all the running jobs.

 2)   Get the running job using Job name

 2) Get all the counter for individual running jobs

 3)  Set this counters in variables.
We could successfully read these counters  , but since we
 need
   to
 show these counters to custom UI , how can we show these
 counters?

we looked into various options to read these counters to
 show
  in
UI
 as following :

  1. Dump these counters to database , however this may be
  overhead
  2. Write web service   and UI will invoke the functions from
  these
 service to show in UI ( However since we need to run *Counter
 reader
 program   *with Hadoop command it might not be feasible to write
 web
 service ?   )

  so the question is can we achive to read the counters using
  simple
 Java APIs ? Does anyone have idea how does the default jobtracker
 JSP
works
 ? we wanted to built something similar to this

 thanks



 --
 Nipen Mark

   
   
   
--
Best Regards
   
Jeff Zhang
   
  
  
  
   --
   Nipen Mark
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark




 --
 Best Regards

 Jeff Zhang




 --
 Best Regards

 Jeff Zhang




-- 
Best Regards

Jeff Zhang


Re: Job Tracker questions

2010-02-04 Thread Mark N
yes we can create a webservice in java which would be called by .net to
display these counters.

But since the java code to read these counters needs use hadoop APIs  ( job
client  ) ,  am not sure we can create a webservice to read the counters

Question is how does the default hadoop task tracker display counter
information in JSP pages ? does it read from the XML files ?

thanks,

On Thu, Feb 4, 2010 at 5:08 PM, Jeff Zhang zjf...@gmail.com wrote:

 I think you can create web service using Java, and then in .net using the
 web service to display the result.


 On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com wrote:

  Do you mean want to connect the JobTracker using .Net ? If so, I'm afraid
 I
  have no idea how to this. The rpc of hadoop is language dependent.
 
 
 
 
  On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com wrote:
 
  could you please elaborate on this  ( * hint to get started  as am very
  new
  to hadoop? )
  So far I could succesfully read all the default and custom counters.
 
  Currently we are having a .net client.
 
  thanks in advance.
 
 
  On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:
 
   Well, you can create a proxy of JobTracker in client side, and then
 you
  can
   use the API of JobTracker to get the information of jobs. The Proxy
 take
   the
   responsibility of  communication with the Master Node.  Read the
 source
   code
   of JobClient can help you.
  
  
   On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com wrote:
  
Ye currently am using jobclient to read these counters.
   
But We are not able to use *webservices *because the jar which is
 used
  to
read the counters from  running hadoop job  is itself a Hadoop
 program
   
If we could have pure Java Api which is run without hadoop command
  then
   we
could return the counter variable into webservices and show in UI.
   
Any help  or technique to show thsese counters in the UI would be
appreciated  ( not necessarily using web service )
   
   
I am using webservices because I am having .net VB client
   
thanks
   
   
   
On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com
 wrote:
   
 I think you can use JobClient to get the counters in your web
  service.
 If you look at the shell script bin/hadoop, you will find that
  actually
 this
 shell use the JobClient to get the counters.



 On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com
  wrote:

  We have a hadoop job running and have used custom counters to
  track
 few
  counters ( like no of successfully processed documents matching
   certain
  conditions)
 
 
  Since we need to get this counters even while the Hadoop job is
   running
,
  we
  wrote another Java program to read these counters
 
 
  *Counter reader  program *will do the following :
 
 
  1)  List all the running jobs.
 
  2)   Get the running job using Job name
 
  2) Get all the counter for individual running jobs
 
  3)  Set this counters in variables.
 We could successfully read these counters  , but since we
  need
to
  show these counters to custom UI , how can we show these
 counters?
 
 we looked into various options to read these counters to
  show
   in
 UI
  as following :
 
   1. Dump these counters to database , however this may be
   overhead
   2. Write web service   and UI will invoke the functions
 from
   these
  service to show in UI ( However since we need to run *Counter
  reader
  program   *with Hadoop command it might not be feasible to
 write
  web
  service ?   )
 
   so the question is can we achive to read the counters using
   simple
  Java APIs ? Does anyone have idea how does the default
 jobtracker
  JSP
 works
  ? we wanted to built something similar to this
 
  thanks
 
 
 
  --
  Nipen Mark
 



 --
 Best Regards

 Jeff Zhang

   
   
   
--
Nipen Mark
   
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
 
 
 
  --
  Nipen Mark
 
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Best Regards

 Jeff Zhang




-- 
Nipen Mark


Re: Job Tracker questions

2010-02-04 Thread Jeff Zhang
I look at the source code, it seems the job tracker web ui also use the
proxy of JobTracker to get the counter information rather the xml file.


On Thu, Feb 4, 2010 at 7:29 PM, Mark N nipen.m...@gmail.com wrote:

 yes we can create a webservice in java which would be called by .net to
 display these counters.

 But since the java code to read these counters needs use hadoop APIs  ( job
 client  ) ,  am not sure we can create a webservice to read the counters

 Question is how does the default hadoop task tracker display counter
 information in JSP pages ? does it read from the XML files ?

 thanks,

 On Thu, Feb 4, 2010 at 5:08 PM, Jeff Zhang zjf...@gmail.com wrote:

  I think you can create web service using Java, and then in .net using the
  web service to display the result.
 
 
  On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com wrote:
 
   Do you mean want to connect the JobTracker using .Net ? If so, I'm
 afraid
  I
   have no idea how to this. The rpc of hadoop is language dependent.
  
  
  
  
   On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com wrote:
  
   could you please elaborate on this  ( * hint to get started  as am
 very
   new
   to hadoop? )
   So far I could succesfully read all the default and custom counters.
  
   Currently we are having a .net client.
  
   thanks in advance.
  
  
   On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com wrote:
  
Well, you can create a proxy of JobTracker in client side, and then
  you
   can
use the API of JobTracker to get the information of jobs. The Proxy
  take
the
responsibility of  communication with the Master Node.  Read the
  source
code
of JobClient can help you.
   
   
On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com
 wrote:
   
 Ye currently am using jobclient to read these counters.

 But We are not able to use *webservices *because the jar which is
  used
   to
 read the counters from  running hadoop job  is itself a Hadoop
  program

 If we could have pure Java Api which is run without hadoop command
   then
we
 could return the counter variable into webservices and show in UI.

 Any help  or technique to show thsese counters in the UI would be
 appreciated  ( not necessarily using web service )


 I am using webservices because I am having .net VB client

 thanks



 On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang zjf...@gmail.com
  wrote:

  I think you can use JobClient to get the counters in your web
   service.
  If you look at the shell script bin/hadoop, you will find that
   actually
  this
  shell use the JobClient to get the counters.
 
 
 
  On Wed, Feb 3, 2010 at 4:34 AM, Mark N nipen.m...@gmail.com
   wrote:
 
   We have a hadoop job running and have used custom counters to
   track
  few
   counters ( like no of successfully processed documents
 matching
certain
   conditions)
  
  
   Since we need to get this counters even while the Hadoop job
 is
running
 ,
   we
   wrote another Java program to read these counters
  
  
   *Counter reader  program *will do the following :
  
  
   1)  List all the running jobs.
  
   2)   Get the running job using Job name
  
   2) Get all the counter for individual running jobs
  
   3)  Set this counters in variables.
  We could successfully read these counters  , but since
 we
   need
 to
   show these counters to custom UI , how can we show these
  counters?
  
  we looked into various options to read these counters
 to
   show
in
  UI
   as following :
  
1. Dump these counters to database , however this may be
overhead
2. Write web service   and UI will invoke the functions
  from
these
   service to show in UI ( However since we need to run *Counter
   reader
   program   *with Hadoop command it might not be feasible to
  write
   web
   service ?   )
  
so the question is can we achive to read the counters
 using
simple
   Java APIs ? Does anyone have idea how does the default
  jobtracker
   JSP
  works
   ? we wanted to built something similar to this
  
   thanks
  
  
  
   --
   Nipen Mark
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark

   
   
   
--
Best Regards
   
Jeff Zhang
   
  
  
  
   --
   Nipen Mark
  
  
  
  
   --
   Best Regards
  
   Jeff Zhang
  
 
 
 
  --
  Best Regards
 
  Jeff Zhang
 



 --
 Nipen Mark




-- 
Best Regards

Jeff Zhang


Re: Inverse of a matrix using Map - Reduce

2010-02-04 Thread Brian Bockelman
Hey Abhishek,

Why would you want to fully invert a matrix that large?

How is it preconditioned?  What is the condition number of the matrix?

Why not just use ScaLAPACK?  It's a hairy beast, but you should definitely 
consider it.

Brian

On Feb 3, 2010, at 9:57 PM, aa...@buffalo.edu wrote:

 Hi,
   Any idea how this method will scale for dense matrices ?The kind of 
 matrices I
 am going to be working with are 500,000*500,000. Will this be a problem. Also
 have you used this patch ?
 
 Best Regards from Buffalo
 
 Abhishek Agrawal
 
 SUNY- Buffalo
 (716-435-7122)
 
 On Wed 02/03/10  1:41 AM , Ganesh Swami gan...@iamganesh.com sent:
 What about the Moore-Penrose inverse?
 
 http://en.wikipedia.org/wiki/Moore-Penrose_pseudoinverse
 
 The pseudo-inverse coincides with the regular inverse when the matrix
 is non-singular. Moreover, it can be computed using the SVD.
 
 Here's a patch for a MapReduce version of the SVD:
 https://issues.apache.org/jira/browse/MAHOUT-180
 Ganesh
 
 On Tue, Feb 2, 2010 at 10:11 PM,  aa...@buffa
 lo.edu wrote: Hello People,
 Â  Â  Â 
 Â  Â  Â My name is Abhishek Agrawal. For
 the last few days I have been trying to figure out how to calculate the
 inverse of a
 matrix using Map Reduce. Matrix inversion has 2 common approaches. Gaussian-
 Jordan and the cofactor of transpose method. But both of them dont seem to 
 be
 suited
 too well for Map- Reduce. Gaussian Jordan involves blocking co factoring a
 matrix requires repeated calculation of determinant.
 
 Can some one give me any pointers so as to how
 to solve this problem ?
 Best Regards from Buffalo
 
 Abhishek Agrawal
 
 SUNY- Buffalo
 (716-435-7122)
 
 
 
 
 
 
 
 
 



smime.p7s
Description: S/MIME cryptographic signature


Re: Maven and Mini MR Cluster

2010-02-04 Thread Michael Basnight
Ya with the hadoop_home stuff i was grasping at straws. My mini MR Cluster has 
a valid classpath i assume, since my entire test runs (thru 3 mapreduce jobs 
via the localrunner) before it gets to the mini MR cluster portion. Is it 
possible to print out the classpath thru the JVMManager or anything else like 
that for debugging purposes?

mb

On Feb 4, 2010, at 5:55 AM, Steve Loughran wrote:

 Michael Basnight wrote:
 Im using maven to run all my unit tests, and i have a unit test that creates 
 a mini mr cluster. When i create this cluster, i get classdefnotfound errors 
 for the core hadoop libs (Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.mapred.Child). When i run the same test w/o creating the 
 mini cluster, well.. it works fine. My HADOOP_HOME is set to the same 
 version as my mvn repo, and points to a valid installation of hadoop. When i 
 validate the classpath thru maven, (dependency:build-classpath), it says 
 that the core libs are on the classpath as well (sourced from my .m2 
 repository). I just cant figure out why hadoop's mini cluster cant find 
 those jars. Running hadoop 0.20.0. Any suggestions?
 
 the miniMR cluster does everything in memory, and doesnt look at HADOOP_HOME, 
 which is only for the shell scripts.
 
 It sounds like you need hadoop-mapreduce on your classpath. Sounds like. the 
 Child class is the entry point used when creating new JVMs, and it is that 
 classpath that isn't right, which is a forked JVM from the one the 
 MiniMRCluster was created in.



Re: configuration file

2010-02-04 Thread Gang Luo
I give the path to that xml file in that command. Do I need to add that path to 
classpath? I try to give a wrong path, there is no error reported.

Aren't those parameters all configurable? like io.sort.mb, mapred.reduce.tasks, 
io.sort.factor, etc. 

Thanks.
-Gang




- 原始邮件 
发件人: Amogh Vasekar am...@yahoo-inc.com
收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
发送日期: 2010/2/4 (周四) 6:09:04 上午
主   题: Re: configuration file

Hi,
A shot in the dark, is the conf file in your classpath? If yes, are the 
parameters you are trying to override marked final?

Amogh


On 2/4/10 3:18 AM, Gang Luo lgpub...@yahoo.com.cn wrote:

Hi,
I am writing script to run whole bunch of jobs automatically. But the 
configuration file doesn't seems working. I think there is something wrong in 
my command.

The command is my script is like:
bin/hadoop jar myJarFile myClass -conf myConfigurationFilr.xml  arg1  agr2 

I use conf.get() so show the value of some parameters. But the values are not 
what I define in that xml file.  Is there something wrong?

Thanks.
-Gang


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Re: Maven and Mini MR Cluster

2010-02-04 Thread Michael Basnight
Ya with the hadoop_home stuff i was grasping at straws. My mini MR Cluster has 
a valid classpath i assume, since my entire test runs (thru 3 mapreduce jobs 
via the localrunner) before it gets to the mini MR cluster portion. Is it 
possible to print out the classpath thru the JVMManager or anything else like 
that for debugging purposes?

mb

On Feb 4, 2010, at 5:55 AM, Steve Loughran wrote:

 Michael Basnight wrote:
 Im using maven to run all my unit tests, and i have a unit test that creates 
 a mini mr cluster. When i create this cluster, i get classdefnotfound errors 
 for the core hadoop libs (Caused by: java.lang.ClassNotFoundException: 
 org.apache.hadoop.mapred.Child). When i run the same test w/o creating the 
 mini cluster, well.. it works fine. My HADOOP_HOME is set to the same 
 version as my mvn repo, and points to a valid installation of hadoop. When i 
 validate the classpath thru maven, (dependency:build-classpath), it says 
 that the core libs are on the classpath as well (sourced from my .m2 
 repository). I just cant figure out why hadoop's mini cluster cant find 
 those jars. Running hadoop 0.20.0. Any suggestions?
 
 the miniMR cluster does everything in memory, and doesnt look at HADOOP_HOME, 
 which is only for the shell scripts.
 
 It sounds like you need hadoop-mapreduce on your classpath. Sounds like. the 
 Child class is the entry point used when creating new JVMs, and it is that 
 classpath that isn't right, which is a forked JVM from the one the 
 MiniMRCluster was created in.



Re: Maven and Mini MR Cluster

2010-02-04 Thread Steve Loughran

Michael Basnight wrote:

Ya with the hadoop_home stuff i was grasping at straws. My mini MR Cluster has 
a valid classpath i assume, since my entire test runs (thru 3 mapreduce jobs 
via the localrunner) before it gets to the mini MR cluster portion. Is it 
possible to print out the classpath thru the JVMManager or anything else like 
that for debugging purposes?



probably, though I don't know what.



Re: Maven and Mini MR Cluster

2010-02-04 Thread Edward Capriolo
On Thu, Feb 4, 2010 at 12:12 PM, Steve Loughran ste...@apache.org wrote:
 Michael Basnight wrote:

 Ya with the hadoop_home stuff i was grasping at straws. My mini MR Cluster
 has a valid classpath i assume, since my entire test runs (thru 3 mapreduce
 jobs via the localrunner) before it gets to the mini MR cluster portion. Is
 it possible to print out the classpath thru the JVMManager or anything else
 like that for debugging purposes?


 probably, though I don't know what.



Normally from a shell script, I do something like this to ensure I
suck up hadoop.jar, hadoop-test.jar, and its dependents.
CPATH=
for f in /opt/hadoop/lib/*.jar ; do
  CPATH=${CPATH}:$f
done
for f in /opt/hadoop/*.jar ; do
  CPATH=${CPATH}:$f
done
java -cp $CPATH

If you are in the build phase you should refer to the build.xml and
build-common and try to emulate that classpath and add you needs.


[ANNOUNCE] Katta 0.6 released

2010-02-04 Thread Johannes Zillmann
Release 0.6 of Katta is now available.
Katta - Lucene (or Hadoop Mapfiles or any content which can be split into 
shards) in the cloud.
http://katta.sourceforge.net


The key changes of the 0.6 release among dozens of bug fixes:
- upgrade lucene to 3.0
- upgrade zookeeper to  3.2.2
- upgrade hadoop to 0.20.1
- generalize katta for serving shard-able content (lucene is one 
implementation, hadoop mapfiles another one)
- basic lucene field sort capability
- more robust zookeeper session expiration handling
- throttling of shard deployment (kb/sec configurable) to hava a stabe search 
while deploying
- load test facility
- monitoring facility
- alpha version of web-gui


The changes from 0.6.rc1 release:
 KATTA-120, fix listIndices for wrong file pathes
 KATTA-117, add command line option to print stacktrace on error
 KATTA-116, fix distribution of shards does not take currently deploying shards 
into account 
 KATTA-107, fix katta execution on cygwin 
 KATTA-112, ship build.xml in core distribution 
 KATTA-110, use a released 0.1 version of zkclient instead of the snapshot


See full list of changes at
http://oss.101tec.com/jira/secure/ReleaseNote.jspa?projectId=1styleName=Htmlversion=10010

Binary distribution is available at
https://sourceforge.net/projects/katta/

The Katta Team

Mapper Process Duration

2010-02-04 Thread Navraj S. Chohan
Hello,
I have a question about mapred.Child processes. Even though a mapper is
finished I see that the process (from ps) stays around longer than reported
on the hadoop MR webpage.
What is the mapper process doing after it has reported that it is finished?
To illustrate my question: I see that one mapper reports it finished in 9
seconds but from logging ps output every second, I see it last for 24
seconds before exiting. I essentially see this for each mapper.

Lastly, where can I find information on how exactly the map reduce framework
reuses JVMs. The reason I'm asking is because I see that with reuse on
(mapred.job.reuse.jvm.num.tasks set to -1), the pid's change for each new
mapper. How can this be without starting a new JVM?
Thanks!

-- 
Navraj S. Chohan
nlak...@gmail.com


Re: EOFException and BadLink, but file descriptors number is ok?

2010-02-04 Thread Meng Mao
I wrote a hadoop job that checks for ulimits across the nodes, and every
node is reporting:
core file size  (blocks, -c) 0
data seg size   (kbytes, -d) unlimited
scheduling priority (-e) 0
file size   (blocks, -f) unlimited
pending signals (-i) 139264
max locked memory   (kbytes, -l) 32
max memory size (kbytes, -m) unlimited
open files  (-n) 65536
pipe size(512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority  (-r) 0
stack size  (kbytes, -s) 10240
cpu time   (seconds, -t) unlimited
max user processes  (-u) 139264
virtual memory  (kbytes, -v) unlimited
file locks  (-x) unlimited


Is anything in there telling about file number limits? From what I
understand, a high open files limit like 65536 should be enough. I estimate
only a couple thousand part-files on HDFS being written to at once, and
around 200 on the filesystem per node.

On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao meng...@gmail.com wrote:

 also, which is the ulimit that's important, the one for the user who is
 running the job, or the hadoop user that owns the Hadoop processes?


 On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote:

 I've been trying to run a fairly small input file (300MB) on Cloudera
 Hadoop 0.20.1. The job I'm using probably writes to on the order of over
 1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
 I get the following exception in the run logs:

 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
 attempt_201001261532_1137_r_13_0, Status : FAILED
 java.io.EOFException
 at java.io.DataInputStream.readByte(DataInputStream.java:250)
 at
 org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
 at org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
 at org.apache.hadoop.io.Text.readString(Text.java:400)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

 lots of EOFExceptions

 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
 attempt_201001261532_1137_r_19_0, Status : FAILED
 java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
  at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%

 From searching around, it seems like the most common cause of BadLink and
 EOFExceptions is when the nodes don't have enough file descriptors set. But
 across all the grid machines, the file-max has been set to 1573039.
 Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

 Where else should I be looking for what's causing this?





What framework Hadoop uses for daemonizing?

2010-02-04 Thread Stas Oskin
Hi.

Just wondering - does anyone know what framework Hadoop uses for
daemonizing?

Any chance it's jsvc from Apache?

Regards.


Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Todd Lipcon
Hi Stas,

Hadoop doesn't daemonize itself. The shell scripts use nohup and a lot
of bash code to achieve a similar idea.

-Todd

On Thu, Feb 4, 2010 at 1:03 PM, Stas Oskin stas.os...@gmail.com wrote:
 Hi.

 Just wondering - does anyone know what framework Hadoop uses for
 daemonizing?

 Any chance it's jsvc from Apache?

 Regards.



Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Stas Oskin
Hi Todd.

Hadoop doesn't daemonize itself. The shell scripts use nohup and a lot
 of bash code to achieve a similar idea.


Was there any design decision behind this approach?

I remember that I had to do the same, as any wrapper just caused the daemon
to run in lower priority then it should.

Regards.


Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Allen Wittenauer



On 2/4/10 1:21 PM, Stas Oskin stas.os...@gmail.com wrote:

 Was there any design decision behind this approach?

Likely KISS.

 I remember that I had to do the same, as any wrapper just caused the daemon
 to run in lower priority then it should.

...which is also easily dealt with from the shell and gives you the
flexibility to use OS-specific constructs.

The other big benefit is that this also means you don't need to UNdaemonize
code for those users that use something besides just pure init rc scripts.
(djbtools, smf, launchd, whatever)



Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Todd Lipcon
On Thu, Feb 4, 2010 at 1:21 PM, Stas Oskin stas.os...@gmail.com wrote:
 Hi Todd.

 Hadoop doesn't daemonize itself. The shell scripts use nohup and a lot
 of bash code to achieve a similar idea.


 Was there any design decision behind this approach?


It long predates my involvement in the project. In fact, it predates
Hadoop itself - it got inherited from Nutch long ago.

I vaguely recall a JIRA about using jsvc for Hadoop - if you search
around I bet you can turn it up.

-Todd


Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Stas Oskin
I actually asked this because I'm looking for a good alternative to current
bunch of scripts and lsb-redhat dependencies I have today in my own Hadoop
client which runs as daemon. So I kinda hoped there is some sauce behind
Hadoop I can borrow.

While this might be not the most appropriate list, I'd appreciate if someone
can say if jsvc can keep the right priorities, or suggest alternative daemon
framework.

Thanks again.


It long predates my involvement in the project. In fact, it predates
 Hadoop itself - it got inherited from Nutch long ago.

 I vaguely recall a JIRA about using jsvc for Hadoop - if you search
 around I bet you can turn it up.

 -Todd



Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Edward Capriolo
On Thu, Feb 4, 2010 at 4:39 PM, Stas Oskin stas.os...@gmail.com wrote:
 I actually asked this because I'm looking for a good alternative to current
 bunch of scripts and lsb-redhat dependencies I have today in my own Hadoop
 client which runs as daemon. So I kinda hoped there is some sauce behind
 Hadoop I can borrow.

 While this might be not the most appropriate list, I'd appreciate if someone
 can say if jsvc can keep the right priorities, or suggest alternative daemon
 framework.

 Thanks again.


 It long predates my involvement in the project. In fact, it predates
 Hadoop itself - it got inherited from Nutch long ago.

 I vaguely recall a JIRA about using jsvc for Hadoop - if you search
 around I bet you can turn it up.

 -Todd


Stas,


Demonizing is one of those native bits java does not do well with by
default. jsrv is an option. I have never had a problem with nohup as
you have, although it is a bit hackish.

Some concepts I was considering
1) Deamontools - manages processes run in the foreground (handles
restarts), no need to demonize
2) linux-ha - much like init scripts but fancy cluster management capabilities

Personally, I am pretty happy with the cloudera LSB scripts. Missing
'status' but ps -ef or jps deals with that.

Do you just have general problems with 'nohup' or have you unearthed a
specific hadoop nohup issue?


Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Stas Oskin
Hi Edward.

Do you just have general problems with 'nohup' or have you unearthed a
 specific hadoop nohup issue?


Just to clarify that we speak about my own Java Hadoop connector here, not
about Hadoop itself, which works just great (with some added pepper from
monit for potential crashes).

I don't like the fact that for simple init script I need to add a full
redhat-lsb package, which just involves having a lot of packages installed.

If Cloudera LSB scripts are self-contained, and (most important) can
generate PID files, I will be happy to give them a look.

Thanks.


Re: Mapper Process Duration

2010-02-04 Thread Navraj S. Chohan
Nevermind,
I had set the reuse to the wrong value. It seems that setting the reuse to 0
acts the same way as setting it to -1.


On Thu, Feb 4, 2010 at 2:52 PM, Navraj S. Chohan nlak...@gmail.com wrote:

 Hello,
 I have a question about mapred.Child processes. Even though a mapper is
 finished I see that the process (from ps) stays around longer than reported
 on the hadoop MR webpage.
 What is the mapper process doing after it has reported that it is
 finished?
 To illustrate my question: I see that one mapper reports it finished in 9
 seconds but from logging ps output every second, I see it last for 24
 seconds before exiting. I essentially see this for each mapper.

 Lastly, where can I find information on how exactly the map reduce
 framework reuses JVMs. The reason I'm asking is because I see that with
 reuse on (mapred.job.reuse.jvm.num.tasks set to -1), the pid's change for
 each new mapper. How can this be without starting a new JVM?
 Thanks!

 --
 Navraj S. Chohan
 nlak...@gmail.com




-- 
Navraj S. Chohan
nlak...@gmail.com


Re: configuration file

2010-02-04 Thread Eric Arenas
Hi Gang,

You have to load the XML config file in your M/R code.

Something like this:
FSDataInputStream inS = fs.open(in);
conf.addResource(inS); 

 
Where conf is your Configuration.

This will in effect read all the parameters from that XML and override anything 
that you have previously set with:
conf.set(parameter,parameterValue);

regards,
Eric Arenas



- Original Message 
From: Gang Luo lgpub...@yahoo.com.cn
To: common-user@hadoop.apache.org
Sent: Thu, February 4, 2010 6:14:54 AM
Subject: Re: configuration file

I give the path to that xml file in that command. Do I need to add that path to 
classpath? I try to give a wrong path, there is no error reported.

Aren't those parameters all configurable? like io.sort.mb, mapred.reduce.tasks, 
io.sort.factor, etc. 

Thanks.
-Gang




- 原始邮件 
发件人: Amogh Vasekar am...@yahoo-inc.com
收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
发送日期: 2010/2/4 (周四) 6:09:04 上午
主   题: Re: configuration file

Hi,
A shot in the dark, is the conf file in your classpath? If yes, are the 
parameters you are trying to override marked final?

Amogh


On 2/4/10 3:18 AM, Gang Luo lgpub...@yahoo.com.cn wrote:

Hi,
I am writing script to run whole bunch of jobs automatically. But the 
configuration file doesn't seems working. I think there is something wrong in 
my command.

The command is my script is like:
bin/hadoop jar myJarFile myClass -conf myConfigurationFilr.xml  arg1  agr2 

I use conf.get() so show the value of some parameters. But the values are not 
what I define in that xml file.  Is there something wrong?

Thanks.
-Gang


  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/



heap memory

2010-02-04 Thread Gang Luo
HI all,
I suppose there is only map function that will consume the heap memory assigned 
to each map task. While the default heap memory is 200 mb, I just wonder most 
of the memory is wasted for a simple map function (e.g. IdentityMapper). 

So, I try to make use of this memory by buffering the output records, or 
maintaining large data structure in memory, but it doesn't work as I expect. 
For example, I want to build a hash table on a 100mb table in memory during the 
life time of that map task. But it fails due to lack of heap memory. Don't I 
get 200mb heap memory? What others also eat my heap memory?

Thanks.
-Gang



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Is it possible to write each key-value pair emitted by the reducer to a different output file

2010-02-04 Thread Udaya Lakshmi
Hi,
  I was wondering if it is possible to write each key-value pair produced by
the reduce function to a different file. How could I open a new file in the
reduce function of the reducer? I know its possible in configure function
but it will write all the output that reducer to that file.
Thanks,
Udaya.


Re: Is it possible to write each key-value pair emitted by the reducer to a different output file

2010-02-04 Thread Amareshwari Sri Ramadasu
See MultipleOutputs at 
http://hadoop.apache.org/common/docs/r0.20.0/api/org/apache/hadoop/mapred/lib/MultipleOutputs.html

-Amareshwari

On 2/5/10 10:41 AM, Udaya Lakshmi udaya...@gmail.com wrote:

Hi,
  I was wondering if it is possible to write each key-value pair produced by
the reduce function to a different file. How could I open a new file in the
reduce function of the reducer? I know its possible in configure function
but it will write all the output that reducer to that file.
Thanks,
Udaya.



Re: What framework Hadoop uses for daemonizing?

2010-02-04 Thread Leen Toelen
Hi,

these are the most used tools

- JSVC http://commons.apache.org/daemon/jsvc.html
http://commons.apache.org/daemon/jsvc.html- Java service wrapper
http://wrapper.tanukisoftware.org/

Windows only:
- JNA windows service
http://wrapper.tanukisoftware.org/doc/english/download.jsp- Windows
service wrapper
http://weblogs.java.net/blog/2008/09/29/winsw-windows-service-wrapper-less-restrictive-license

http://weblogs.java.net/blog/2008/09/29/winsw-windows-service-wrapper-less-restrictive-license
Regards,
Leen

On Thu, Feb 4, 2010 at 10:39 PM, Stas Oskin stas.os...@gmail.com wrote:

 I actually asked this because I'm looking for a good alternative to current
 bunch of scripts and lsb-redhat dependencies I have today in my own Hadoop
 client which runs as daemon. So I kinda hoped there is some sauce behind
 Hadoop I can borrow.

 While this might be not the most appropriate list, I'd appreciate if
 someone
 can say if jsvc can keep the right priorities, or suggest alternative
 daemon
 framework.

 Thanks again.


 It long predates my involvement in the project. In fact, it predates
  Hadoop itself - it got inherited from Nutch long ago.
 
  I vaguely recall a JIRA about using jsvc for Hadoop - if you search
  around I bet you can turn it up.
 
  -Todd
 



Re: EOFException and BadLink, but file descriptors number is ok?

2010-02-04 Thread Meng Mao
not sure what else I could be checking to see where the problem lies. Should
I be looking in the datanode logs? I looked briefly in there and didn't see
anything from around the time exceptions started getting reported.
lsof during the job execution? Number of open threads?

I'm at a loss here.

On Thu, Feb 4, 2010 at 2:52 PM, Meng Mao meng...@gmail.com wrote:

 I wrote a hadoop job that checks for ulimits across the nodes, and every
 node is reporting:
 core file size  (blocks, -c) 0
 data seg size   (kbytes, -d) unlimited
 scheduling priority (-e) 0
 file size   (blocks, -f) unlimited
 pending signals (-i) 139264
 max locked memory   (kbytes, -l) 32
 max memory size (kbytes, -m) unlimited
 open files  (-n) 65536
 pipe size(512 bytes, -p) 8
 POSIX message queues (bytes, -q) 819200
 real-time priority  (-r) 0
 stack size  (kbytes, -s) 10240
 cpu time   (seconds, -t) unlimited
 max user processes  (-u) 139264
 virtual memory  (kbytes, -v) unlimited
 file locks  (-x) unlimited


 Is anything in there telling about file number limits? From what I
 understand, a high open files limit like 65536 should be enough. I estimate
 only a couple thousand part-files on HDFS being written to at once, and
 around 200 on the filesystem per node.

 On Wed, Feb 3, 2010 at 4:04 PM, Meng Mao meng...@gmail.com wrote:

 also, which is the ulimit that's important, the one for the user who is
 running the job, or the hadoop user that owns the Hadoop processes?


 On Tue, Feb 2, 2010 at 7:29 PM, Meng Mao meng...@gmail.com wrote:

 I've been trying to run a fairly small input file (300MB) on Cloudera
 Hadoop 0.20.1. The job I'm using probably writes to on the order of over
 1000 part-files at once, across the whole grid. The grid has 33 nodes in it.
 I get the following exception in the run logs:

 10/01/30 17:24:25 INFO mapred.JobClient:  map 100% reduce 12%
 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
 attempt_201001261532_1137_r_13_0, Status : FAILED
 java.io.EOFException
 at java.io.DataInputStream.readByte(DataInputStream.java:250)
 at
 org.apache.hadoop.io.WritableUtils.readVLong(WritableUtils.java:298)
 at
 org.apache.hadoop.io.WritableUtils.readVInt(WritableUtils.java:319)
 at org.apache.hadoop.io.Text.readString(Text.java:400)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2869)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

 lots of EOFExceptions

 10/01/30 17:24:25 INFO mapred.JobClient: Task Id :
 attempt_201001261532_1137_r_19_0, Status : FAILED
 java.io.IOException: Bad connect ack with firstBadLink 10.2.19.1:50010
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.createBlockOutputStream(DFSClient.java:2871)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2794)
  at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2077)
 at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2263)

 10/01/30 17:24:36 INFO mapred.JobClient:  map 100% reduce 11%
 10/01/30 17:24:42 INFO mapred.JobClient:  map 100% reduce 12%
 10/01/30 17:24:49 INFO mapred.JobClient:  map 100% reduce 13%
 10/01/30 17:24:55 INFO mapred.JobClient:  map 100% reduce 14%
 10/01/30 17:25:00 INFO mapred.JobClient:  map 100% reduce 15%

 From searching around, it seems like the most common cause of BadLink and
 EOFExceptions is when the nodes don't have enough file descriptors set. But
 across all the grid machines, the file-max has been set to 1573039.
 Furthermore, we set ulimit -n to 65536 using hadoop-env.sh.

 Where else should I be looking for what's causing this?