Re: OutOfMemoryError of PIG job (UDF loads big file)

2010-02-23 Thread Ankur C. Goel
Yeah! Wait till you stumble across the need to adjust shuffle/reduce buffers, 
reuse JVMs, sort factor, copier threads .
:-)

On 2/23/10 8:06 AM, jiang licht licht_ji...@yahoo.com wrote:

Thanks Jeff. I also just found this one and solved my problem. BTW, so many 
settings to play with :)


Michael

--- On Mon, 2/22/10, Jeff Zhang zjf...@gmail.com wrote:

From: Jeff Zhang zjf...@gmail.com
Subject: Re: OutOfMemoryError of PIG job (UDF loads big file)
To: common-user@hadoop.apache.org
Date: Monday, February 22, 2010, 8:13 PM

Hi Jiang,

you should set property *mapred.child.java.opts* in mapred-site.xml to
increase the memeory
as following:

 property
 namemapred.child.java.opts/name
 value-Xmx1024m/value
 /property

 and then restart your hadoop cluster




On Tue, Feb 23, 2010 at 9:43 AM, jiang licht licht_ji...@yahoo.com wrote:

 I am running a hadoop job written in PIG. It fails from out of memory
 because a UDF function consumes a lot of memory, it loads a big file. What
 are the settings to avoid the following OutOfMemoryError? I guess by simply
 giving PIG big memory (java -XmxBIGmemory org.apache.pig.Main ...) won't
 work.

 Error message ---

 java.lang.OutOfMemoryError: Java heap space
at java.util.regex.Pattern.compile(Pattern.java:1451)
at java.util.regex.Pattern.(Pattern.java:1133)
at java.util.regex.Pattern.compile(Pattern.java:823)
at java.lang.String.split(String.java:2293)
at java.lang.String.split(String.java:2335)
at UDF.load(Unknown Source)
at UDF.load(Unknown Source)
at UDF.exec(Unknown Source)
at UDF.exec(Unknown Source)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
at org.apache.hadoop.mapred.Child.main(Child.java:155)

 Thanks!
 Michael








--
Best Regards

Jeff Zhang







Re: why not zookeeper for the namenode

2010-02-23 Thread Eli Collins
 From what I read, I thought, that bookkeeper would be the ideal enhancement
 for the namenode, to make it distributed and therefor finaly highly available.

Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).

 - Why hasn't zookeeper(-bookkeeper) not been chosen?

Persisting NN metadata over a set of servers is only part of the
problem. You might be interested in checking out HDFS-976.

Thanks,
Eli


Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

E. Sammer wrote:

On 2/22/10 12:53 PM, zlatin.balev...@barclayscapital.com wrote:
My 2 cents: If the NN stores all state behind a javax.Cache façade it 
will be possible to use all kinds of products (commercial, open 
source, facades to ZK) for redundancy, load balancing, etc.


This would be pretty interesting from a deployment / scale point of view 
in that many jcache providers support flushing to disk and the concept 
of distribution and near / far data. Now, all of that said, this also 
removes some certainty from the name node contract, namely:


If jcache was used we:
 - couldn't promise data would be in memory (both a plus and a minus).
 - couldn't promise data is on the same machine.
 - can't make guarantees about consistency (flushes, PITR features, etc.).

It may be too general an abstraction layer for this type of application. 
It's a good avenue to explore, just playing devil's advocate.


Regards.


I know nothing about HA, though I work with people who do. They normally 
start their conversations with Steve, you idiot, you don't understand 
 Because HA is all or nothing: either you are HA or you aren't. 
It's also somewhere you need to reach for the mathematics to prove works 
in theory, then test in the field in interesting situations. Even then, 
they have horror stories.


We all know that NN's have limits, but most of those limits are known 
and documented:

 -doesn't like full disks
 -a corrupted editlog doesn't replay
 -if the 2ary NN isn't live, the NN will stay up (It's been discussed 
having it somehow react if there isn't any secondary around and you say 
you require one)


There's also an open issue that may be related to UTF-8 translation that 
is confusing restarts, under discussion in -hdfs right now.


What is best is this: everyone has the same code base, one that is 
tested at scale.


If you switch to multiple back ends, then nobody other than the people 
who have the same back end as you will be able to replicate the problem. 
You lose the Yahoo! and Facebook run on PetaByte sized clusters, so my 
40 TB is noise argument, replacing it with Yahoo! and Facebook run on 
one cache back end, I use something else and am on my own when it fails.


I don't want to go there




Re: why not zookeeper for the namenode

2010-02-23 Thread Steve Loughran

Eli Collins wrote:

From what I read, I thought, that bookkeeper would be the ideal enhancement
for the namenode, to make it distributed and therefor finaly highly available.


Being distributed doesn't imply high availability. Availability is
about minimizing downtime. For example, a primary that can fail over
to a secondary (and back) may be more available than a distributed
system that needs to be restarted when it's software or dependencies
are upgraded. A distributed system that can only handle x ops/second
may be less available than a non-distributed system that can handle 2x
ops/second. A large distributed system may be less available than n
smaller systems, depending on consistency requirements. Implementation
and management complexity often result in more downtime. Etc.
Decreasing the time it takes to restart and upgrade an HDFS cluster
would significantly improve it's availability for many users (there
are jiras for these).



There's another availability, engineering availability. What we have 
today is nice in that the HDFS engineering skills are distributed among 
a number of companies, and the source is there for anyone else to learn, 
rebuilding is fairly straightforward.


Don't dismiss that as unimportant. Engineering availability means that 
if you discover a new problem, you have the ability to patch your copy 
of the code, and keep going, while filing a bug report for others to 
deal with. That significantly reduces your downtime on a software 
problem compared to filing a bugrep and hoping that a future release 
will have addressed it


-steve



How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Tim Kiefer

Hi there,

can anybody help me out on a (most likely) simple unclarity.

I am wondering how intermediate key/value pairs are materialized. I have 
a job where the map phase produces 600,000 records and map output bytes 
is ~300GB. What I thought (up to now) is that these 600,000 records, 
i.e., 300GB, are materialized locally by the mappers and that later on 
reducers pull these records (based on the key).
What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter 
is as high as ~900GB.


So - where does the factor 3 come from between Map output bytes and 
FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
file system - but that should be HDFS only?!


Thanks
- tim


MultipleOutputs and new 20.1 API

2010-02-23 Thread Chris White
Does the code for MutipleOutputs work (as described in the Javadocs) for 
the new 20.1 API? This object expects a JobConf to most of its methods, 
which is deprecated in 20.1? If not then is there any plans to update 
MultipleOutputs to work with the new API?


Re: Job Tracker questions

2010-02-23 Thread Mark N
i have to find Counter and i am using inetsocketAddress and connecting
into the jobtracker jsp
with help of it am able to find the counter when it is running on pseudo mode

but as i am trying to run this in cluster . i am not able to get any
counters . SI there any parameters that I need to change in
configuration to read the counter information ?


Thanks in advance


On Tue, Feb 9, 2010 at 8:49 PM, Jeff Zhang zjf...@gmail.com wrote:

 JobClient also use proxy of JobTracker.


 On Mon, Feb 8, 2010 at 11:19 PM, Mark N nipen.m...@gmail.com wrote:

  Did you check the jobClient source code?
 
 
  On Thu, Feb 4, 2010 at 5:21 PM, Jeff Zhang zjf...@gmail.com wrote:
 
   I look at the source code, it seems the job tracker web ui also use the
   proxy of JobTracker to get the counter information rather the xml file.
  
  
   On Thu, Feb 4, 2010 at 7:29 PM, Mark N nipen.m...@gmail.com wrote:
  
yes we can create a webservice in java which would be called by .net
 to
display these counters.
   
But since the java code to read these counters needs use hadoop APIs
  (
   job
client  ) ,  am not sure we can create a webservice to read the
  counters
   
Question is how does the default hadoop task tracker display counter
information in JSP pages ? does it read from the XML files ?
   
thanks,
   
On Thu, Feb 4, 2010 at 5:08 PM, Jeff Zhang zjf...@gmail.com wrote:
   
 I think you can create web service using Java, and then in .net
 using
   the
 web service to display the result.


 On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com
 wrote:

  Do you mean want to connect the JobTracker using .Net ? If so,
 I'm
afraid
 I
  have no idea how to this. The rpc of hadoop is language
 dependent.
 
 
 
 
  On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com
  wrote:
 
  could you please elaborate on this  ( * hint to get started  as
 am
very
  new
  to hadoop? )
  So far I could succesfully read all the default and custom
  counters.
 
  Currently we are having a .net client.
 
  thanks in advance.
 
 
  On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com
   wrote:
 
   Well, you can create a proxy of JobTracker in client side, and
   then
 you
  can
   use the API of JobTracker to get the information of jobs. The
   Proxy
 take
   the
   responsibility of  communication with the Master Node.  Read
 the
 source
   code
   of JobClient can help you.
  
  
   On Thu, Feb 4, 2010 at 6:59 PM, Mark N nipen.m...@gmail.com
wrote:
  
Ye currently am using jobclient to read these counters.
   
But We are not able to use *webservices *because the jar
 which
   is
 used
  to
read the counters from  running hadoop job  is itself a
 Hadoop
 program
   
If we could have pure Java Api which is run without hadoop
   command
  then
   we
could return the counter variable into webservices and show
 in
   UI.
   
Any help  or technique to show thsese counters in the UI
 would
   be
appreciated  ( not necessarily using web service )
   
   
I am using webservices because I am having .net VB client
   
thanks
   
   
   
On Wed, Feb 3, 2010 at 8:33 PM, Jeff Zhang 
 zjf...@gmail.com
 wrote:
   
 I think you can use JobClient to get the counters in your
  web
  service.
 If you look at the shell script bin/hadoop, you will find
  that
  actually
 this
 shell use the JobClient to get the counters.



 On Wed, Feb 3, 2010 at 4:34 AM, Mark N 
  nipen.m...@gmail.com
  wrote:

  We have a hadoop job running and have used custom
 counters
   to
  track
 few
  counters ( like no of successfully processed documents
matching
   certain
  conditions)
 
 
  Since we need to get this counters even while the Hadoop
  job
is
   running
,
  we
  wrote another Java program to read these counters
 
 
  *Counter reader  program *will do the following :
 
 
  1)  List all the running jobs.
 
  2)   Get the running job using Job name
 
  2) Get all the counter for individual running jobs
 
  3)  Set this counters in variables.
 We could successfully read these counters  , but
   since
we
  need
to
  show these counters to custom UI , how can we show these
 counters?
 
 we looked into various options to read these
  counters
to
  show
   in
 UI
  as following :
 
   1. Dump these counters to 

Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Gang Luo
Hi Tim,
the intermediate data is materialized to local file system. Before it is 
available for reducers, mappers will sort them. If the buffer (io.sort.mb) is 
too small for the intermediate data, multi-phase sorting happen, which means 
you read and write the same bit more than one time. 

Besides, are you looking at the statistics per mapper through the job tracker, 
or just the information output when a job finish? If you look at the 
information given out at the end of the job, note that this is an overall 
statistics which include sorting at reduce side. It also include the amount of 
data written to HDFS (I am not 100% sure).

And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

 -Gang


- 原始邮件 
发件人: Tim Kiefer tim-kie...@gmx.de
收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
发送日期: 2010/2/23 (周二) 6:44:28 上午
主   题: How are intermediate key/value pairs materialized between map and reduce?

Hi there,

can anybody help me out on a (most likely) simple unclarity.

I am wondering how intermediate key/value pairs are materialized. I have a job 
where the map phase produces 600,000 records and map output bytes is ~300GB. 
What I thought (up to now) is that these 600,000 records, i.e., 300GB, are 
materialized locally by the mappers and that later on reducers pull these 
records (based on the key).
What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
high as ~900GB.

So - where does the factor 3 come from between Map output bytes and 
FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file 
system - but that should be HDFS only?!

Thanks
- tim



  ___ 
  好玩贺卡等你发,邮箱贺卡全新上线! 
http://card.mail.cn.yahoo.com/


Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Tim Kiefer
Hi Gang,

thanks for your reply.

To clarify: I look at the statistics through the job tracker. In the
webinterface for my job I have columns for map, reduce and total. What I
was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
Output Bytes in the map column.

About the replication factor: I would expect the exact same thing -
changing to 6 has no influence on FILE_BYTES_WRITTEN.

About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
Furthermore, I have 40 mappers and map output data is ~300GB. I can't
see how that ends up in a factor 3?

- tim

Am 23.02.2010 14:39, schrieb Gang Luo:
 Hi Tim,
 the intermediate data is materialized to local file system. Before it is 
 available for reducers, mappers will sort them. If the buffer (io.sort.mb) is 
 too small for the intermediate data, multi-phase sorting happen, which means 
 you read and write the same bit more than one time. 

 Besides, are you looking at the statistics per mapper through the job 
 tracker, or just the information output when a job finish? If you look at the 
 information given out at the end of the job, note that this is an overall 
 statistics which include sorting at reduce side. It also include the amount 
 of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
 think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


 - 原始邮件 
 发件人: Tim Kiefer tim-kie...@gmx.de
 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
 发送日期: 2010/2/23 (周二) 6:44:28 上午
 主   题: How are intermediate key/value pairs materialized between map and 
 reduce?

 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I have a 
 job where the map phase produces 600,000 records and map output bytes is 
 ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
 300GB, are materialized locally by the mappers and that later on reducers 
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
 high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and 
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file 
 system - but that should be HDFS only?!

 Thanks
 - tim



   ___ 
   好玩贺卡等你发,邮箱贺卡全新上线! 
 http://card.mail.cn.yahoo.com/
   


Re: Wrong FS

2010-02-23 Thread Edson Ramiro
Thanks Marc and Bill

I solved this Wrong FS problem editing the /etc/hosts as Marc said.

Now, the cluster is working ok  : ]

master01
127.0.0.1 localhost.localdomain localhost
10.0.0.101 master01
10.0.0.102 master02
10.0.0.200 slave00
10.0.0.201 slave01

master02
127.0.0.1 localhost.localdomain localhost
10.0.0.101 master01
10.0.0.102 master02
10.0.0.200 slave00
10.0.0.201 slave01

slave00
127.0.0.1 slave00 localhost.localdomain localhost
10.0.0.101 master01
10.0.0.102 master02
10.0.0.201 slave01

slave01
127.0.0.1 slave01 localhost.localdomain localhost
10.0.0.101 master01
10.0.0.102 master02
10.0.0.200 slave00


Edson Ramiro


On 22 February 2010 17:56, Marc Farnum Rendino mvg...@gmail.com wrote:

 Perhaps an /etc/hosts file is sufficient.

 However, FWIW, I didn't get it working til I moved to using all the real
 FQDNs.

 - Marc



Re: Native libraries on OS X

2010-02-23 Thread Derek Brown
Allen,
You mean compiling the native library? I did ant compile-native. My code
to read/write the files was built in Eclipse, but the same problem occurs
with testsequencefile in hadoop-0.21-test.jar.

What would you recommend regarding how to build, or tweak or regenerate the
configure scripts?

Thanks,
Derek



From: Allen Wittenauer awittena...@linkedin.com
To: common-user@hadoop.apache.org
Date: Mon, 22 Feb 2010 15:00:04 -0800
Subject: Re: Native libraries on OS X

Did you compile this outside of ant?

If so, you probably got bit by the fact that the configure scripts assume
that the bitness level is shared via an environment variable.

[I really need to finish cleaning up my configure script patches. :( ]


Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Ed Mazur
Hi Tim,

I'm guessing a lot of these writes are happening on the reduce side.
On the JT web interface, there are three columns: map, reduce,
overall. Is the 900GB figure from the overall column? The value in the
map column will probably be closer to what you were expecting. There
are writes on the reduce side too during the shuffle and multi-pass
merge.

Ed

2010/2/23 Tim Kiefer tim-kie...@gmx.de:
 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:
 Hi Tim,
 the intermediate data is materialized to local file system. Before it is 
 available for reducers, mappers will sort them. If the buffer (io.sort.mb) 
 is too small for the intermediate data, multi-phase sorting happen, which 
 means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job 
 tracker, or just the information output when a job finish? If you look at 
 the information given out at the end of the job, note that this is an 
 overall statistics which include sorting at reduce side. It also include the 
 amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
 think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


 - 原始邮件 
 发件人: Tim Kiefer tim-kie...@gmx.de
 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
 发送日期: 2010/2/23 (周二) 6:44:28 上午
 主   题: How are intermediate key/value pairs materialized between map and 
 reduce?

 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I have a 
 job where the map phase produces 600,000 records and map output bytes is 
 ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
 300GB, are materialized locally by the mappers and that later on reducers 
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
 high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and 
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
 file system - but that should be HDFS only?!

 Thanks
 - tim



   ___
   好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/




Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Tim Kiefer
No... 900GB is in the map column. Reduce adds another ~70GB of
FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

Am 23.02.2010 16:11, schrieb Ed Mazur:
 Hi Tim,

 I'm guessing a lot of these writes are happening on the reduce side.
 On the JT web interface, there are three columns: map, reduce,
 overall. Is the 900GB figure from the overall column? The value in the
 map column will probably be closer to what you were expecting. There
 are writes on the reduce side too during the shuffle and multi-pass
 merge.

 Ed

 2010/2/23 Tim Kiefer tim-kie...@gmx.de:
   
 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:
 
 Hi Tim,
 the intermediate data is materialized to local file system. Before it is 
 available for reducers, mappers will sort them. If the buffer (io.sort.mb) 
 is too small for the intermediate data, multi-phase sorting happen, which 
 means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job 
 tracker, or just the information output when a job finish? If you look at 
 the information given out at the end of the job, note that this is an 
 overall statistics which include sorting at reduce side. It also include 
 the amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. 
 I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


 - 原始邮件 
 发件人: Tim Kiefer tim-kie...@gmx.de
 收件人: common-user@hadoop.apache.org common-user@hadoop.apache.org
 发送日期: 2010/2/23 (周二) 6:44:28 上午
 主   题: How are intermediate key/value pairs materialized between map and 
 reduce?

 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I have a 
 job where the map phase produces 600,000 records and map output bytes is 
 ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
 300GB, are materialized locally by the mappers and that later on reducers 
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is 
 as high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and 
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
 file system - but that should be HDFS only?!

 Thanks
 - tim



   ___
   好玩贺卡等你发,邮箱贺卡全新上线!
 http://card.mail.cn.yahoo.com/

   
 


Re: New or old Map/Reduce API ?

2010-02-23 Thread Patrick Angeles
Hello Kai,

To answer your questions:

- Most of the missing stuff from the new API are convenience classes --
InputFormats, OutputFormats, etc. One very handy class that is missing from
the new API is MultipleOutputs which allows you to write multiple files in a
single pass.
- You cannot mix classes from the old API and the new API in the same
MapReduce job.
- However, you can run different jobs that use either the old or new API in
the same cluster, and their data inputs and outputs should be compatible.
- Figuring out how to do things right in the new versus the old API is
really a matter of having a template MapReduce job (e.g., WordCount) for
each API. The surrounding classes are different, but the way you write your
map and reduce functions are largely the same.
- Porting old code to the new API is not hard (especially with an IDE), but
it can get tedious. As an exercise, you could try porting WordCount from the
old API to the new one.
- It will probably take more than a year to fully remove the old APIs.
Today, the old API is marked as 'deprecated' which is really a misnomer
since the new API doesn't provide a full alternative to the old.

My recommendation would be to learn and use both. Personally, I find the new
API to be cleaner and more 'Java-like'. For someone just picking this stuff
up, the old APIs might be easier as they correlate exactly to the examples
in print.

Regards,

- Patrick

On Tue, Feb 23, 2010 at 10:00 AM, Kai Londenberg londenb...@nurago.comwrote:

 Hi ..

 I'm currently trying to get information for a decision on which version of
 the Map/Reduce API to use for our Map/Reduce Jobs.  We have existing
 (production) code that uses the old API and some new code (non-production)
 that uses the new API, and we have some developers who will definitely not
 have much time to dig into Hadoop Sources to figure out how to do things
 right in the new API (instead of being able to look it up in a book), so the
 state of documentation does matter.

 So far I got the following information:


 -  The old API is deprecated and will be removed, but that will
 probably take at least a year

 -  The new API does not provide really new functionality

 -  There's library and contrib. code that has not been ported to
 the new API yet

 -  Most of the existing thorough documentation (like Hadoop: the
 definitive Guide) covers the old API

 -  Porting to the new API will probably become easier in future
 versions of Hadoop when more of the lib code and docs have been ported

 So, what are your experiences with new vs. old API ?  Would you recommend
 to switch to the new API right now, or wait for a later release ?  Is it
 problematic to have applications using old and new API side by side ? How
 hard is it currently  to port old code to the new API ?

 If these questions have been covered by some other thread already, please
 point me to it. I could not find much of a discussion browsing the mailing
 list archives, though.

 Thanks in advance for any advice you can give,


 Kai Londenberg

 . . . . . . . . . . . . . . . . . . . . . . . .
 Software Developer

 nurago GmbH
 applied research technologies
 Kurt-Schumacher-Str. 24 . 30159 Hannover Tel. +49 511 213 866 . 0 Fax +49
 511 213 866 . 22

 londenb...@nurago.commailto:londenb...@nurago.com . www.nurago.com
 http://www.nurago.com

 Geschäftsführer: Thomas Knauer
 Amtsgericht Hannover: HRB 201817
 UID (Vat)-No: DE 2540 787 09




Custom InputFormat: LineRecordReader.LineReader reads 0 bytes

2010-02-23 Thread Alexey Tigarev
Hi All!

I am implementing a custom InputFormat.
Its custom RecordReader uses LineRecordReader.LineReader inside.

In some cases its read() method returns 0, i.e. reads 0 bytes. This
happen also in unit test where it reads form a regular file on UNIX
filesystem.
What does it mean and how should I handle it/avoid it?

After examining the sources I can see that zero count of bytes read
comes from the value returned by FSDataInputStream.read().

In which situations can this happen? -1 would mean end of file, 0 can
mean that no data was available (but end of file still not reached) -
I can not understand how this can happen in case of local file.

Please give me advice :)

Regards,
Alexey Tigarev
ti...@nlp.od.ua Jabber: ti...@jabber.od.ua Skype: t__gra

Как программисту стать фрилансером и заработать первую $1000 на oDesk:
http://freelance-start.com/earn-first-1000-on-odesk


Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map output

2010-02-23 Thread jiang licht
Thanks, Amogh. Good to know :)


Michael

--- On Tue, 2/23/10, Amogh Vasekar am...@yahoo-inc.com wrote:

From: Amogh Vasekar am...@yahoo-inc.com
Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map 
output
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Tuesday, February 23, 2010, 1:45 AM

Hi,
Certainly this might not cause the issue. But,
Hadoop native library is supported only on *nix platforms only. Unfortunately 
it is known not to work on Cygwin    and Mac OS X   and has mainly been used on 
the  GNU/Linux platform.

http://hadoop.apache.org/common/docs/current/native_libraries.html#Supported+Platforms

The mapper log would throw more light on this

Amogh


On 2/23/10 11:41 AM, jiang licht licht_ji...@yahoo.com wrote:

Thanks Amogh. The platform that I got this error is mac os x and hadoop 0.20.1. 
All native library installed except lzo (which will report that codec not 
found). But I didn't see this error when I ran the same thing w/o expression 
specified, in addition I also ran sth with the same expression setting on 
Fedora 8 and 0.19.1 without any problem. So, I think it might depends on some 
other settings (wrt what spill is about).

Thanks,

Michael

--- On Mon, 2/22/10, Amogh Vasekar am...@yahoo-inc.com wrote:

From: Amogh Vasekar am...@yahoo-inc.com
Subject: Re: java.io.IOException: Spill failed when using w/ GzipCodec for Map 
output
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Monday, February 22, 2010, 11:27 PM

Hi,
Can you please let us know what platform you are running on your hadoop 
machines?
For gzip and lzo to work, you need supported hadoop native libraries ( I 
remember reading on this somewhere in hadoop wiki :) )

Amogh


On 2/23/10 8:16 AM, jiang licht licht_ji...@yahoo.com wrote:

I have a pig script. If I don't set any codec for Map output for hadoop 
cluster, no problem. Now I made the following compression settings, the job 
failed and the error message is shown below. I guess there are some other 
settings that should be correctly set together with using the compression. Im 
using 0.20.1. Any thoughts? Thanks for your help!

mapred-site.xml
        property
                namemapred.compress.map.output/name
                valuetrue/value
        /property
        property
                namemapred.map.output.compression.codec/name
                valueorg.apache.hadoop.io.compress.GzipCodec/value
        /property

error message of failed map task---

java.io.IOException: Spill failed
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:822)
        at 
org.apache.hadoop.mapred.MapTask$OldOutputCollector.collect(MapTask.java:466)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.collect(PigMapReduce.java:108)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:251)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
        at 
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:358)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)
Caused by: java.lang.NullPointerException
        at org.apache.hadoop.mapred.IFile$Writer.(IFile.java:102)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.sortAndSpill(MapTask.java:1198)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer.access$1800(MapTask.java:648)
        at 
org.apache.hadoop.mapred.MapTask$MapOutputBuffer$SpillThread.run(MapTask.java:1135)



Thanks,

Michael












  

Re: OutOfMemoryError of PIG job (UDF loads big file)

2010-02-23 Thread jiang licht
Hm, Im wondering if there are some case studies regarding how ppl handle memory 
related issues posted somewhere as good references?

Thanks,

Michael

--- On Tue, 2/23/10, Ankur C. Goel gan...@yahoo-inc.com wrote:

From: Ankur C. Goel gan...@yahoo-inc.com
Subject: Re: OutOfMemoryError of PIG job (UDF loads big file)
To: common-user@hadoop.apache.org common-user@hadoop.apache.org
Date: Tuesday, February 23, 2010, 2:11 AM

Yeah! Wait till you stumble across the need to adjust shuffle/reduce buffers, 
reuse JVMs, sort factor, copier threads .
:-)

On 2/23/10 8:06 AM, jiang licht licht_ji...@yahoo.com wrote:

Thanks Jeff. I also just found this one and solved my problem. BTW, so many 
settings to play with :)


Michael

--- On Mon, 2/22/10, Jeff Zhang zjf...@gmail.com wrote:

From: Jeff Zhang zjf...@gmail.com
Subject: Re: OutOfMemoryError of PIG job (UDF loads big file)
To: common-user@hadoop.apache.org
Date: Monday, February 22, 2010, 8:13 PM

Hi Jiang,

you should set property *mapred.child.java.opts* in mapred-site.xml to
increase the memeory
as following:

 property
         namemapred.child.java.opts/name
         value-Xmx1024m/value
 /property

 and then restart your hadoop cluster




On Tue, Feb 23, 2010 at 9:43 AM, jiang licht licht_ji...@yahoo.com wrote:

 I am running a hadoop job written in PIG. It fails from out of memory
 because a UDF function consumes a lot of memory, it loads a big file. What
 are the settings to avoid the following OutOfMemoryError? I guess by simply
 giving PIG big memory (java -XmxBIGmemory org.apache.pig.Main ...) won't
 work.

 Error message ---

 java.lang.OutOfMemoryError: Java heap space
        at java.util.regex.Pattern.compile(Pattern.java:1451)
        at java.util.regex.Pattern.(Pattern.java:1133)
        at java.util.regex.Pattern.compile(Pattern.java:823)
        at java.lang.String.split(String.java:2293)
        at java.lang.String.split(String.java:2335)
        at UDF.load(Unknown Source)
        at UDF.load(Unknown Source)
        at UDF.exec(Unknown Source)
        at UDF.exec(Unknown Source)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:201)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc.getNext(POUserFunc.java:287)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
        at
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
        at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
        at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
        at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:332)
        at org.apache.hadoop.mapred.Child.main(Child.java:155)

 Thanks!
 Michael








--
Best Regards

Jeff Zhang








  

JobConf.setJobEndNotificationURI

2010-02-23 Thread Ted Yu
Hi,
I am looking for counterpart to JobConf.setJobEndNotificationURI() in
org.apache.hadoop.mapreduce

Please advise.

Thanks


Re: Native libraries on OS X

2010-02-23 Thread Allen Wittenauer

Hmm. Weird then.

Maybe try doing the configure straight up then, setting JVM_DATA_MODEL=64
manually.  The base problem you are having is that -m64 isn't getting set to
build a 64-bit library.




On 2/23/10 7:06 AM, Derek Brown de...@media6degrees.com wrote:

 Allen,
 You mean compiling the native library? I did ant compile-native. My code
 to read/write the files was built in Eclipse, but the same problem occurs
 with testsequencefile in hadoop-0.21-test.jar.
 
 What would you recommend regarding how to build, or tweak or regenerate the
 configure scripts?
 
 Thanks,
 Derek
 
 
 
 From: Allen Wittenauer awittena...@linkedin.com
 To: common-user@hadoop.apache.org
 Date: Mon, 22 Feb 2010 15:00:04 -0800
 Subject: Re: Native libraries on OS X
 
 Did you compile this outside of ant?
 
 If so, you probably got bit by the fact that the configure scripts assume
 that the bitness level is shared via an environment variable.
 
 [I really need to finish cleaning up my configure script patches. :( ]



Re: On CDH2, (Cloudera EC2) No valid local directories in property: mapred.local.dir

2010-02-23 Thread Todd Lipcon
Hi Saptarshi,

Can you please ssh into the JobTracker node and check that this
directory is mounted, writable by the hadoop user, and not full?

-Todd

On Fri, Feb 19, 2010 at 2:13 PM, Saptarshi Guha
saptarshi.g...@gmail.com wrote:
 Hello,
 Not sure if i should post this here or on Cloudera's message board,
 but here goes.
 When I run EC2 using the latest CDH2 and Hadoop 0.20 (by settiing the
 env variables are hadoop-ec2),
 and launch a job
 hadoop jar ...

 I get the following error


 10/02/19 17:04:55 WARN mapred.JobClient: Use GenericOptionsParser for
 parsing the arguments. Applications should implement Tool for the
 same.
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: No valid
 local directories in property: mapred.local.dir
        at 
 org.apache.hadoop.conf.Configuration.getLocalPath(Configuration.java:975)
        at org.apache.hadoop.mapred.JobConf.getLocalPath(JobConf.java:279)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:256)
        at 
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:240)
        at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3026)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:966)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:962)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:960)

        at org.apache.hadoop.ipc.Client.call(Client.java:740)
        at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220)
        at org.apache.hadoop.mapred.$Proxy0.submitJob(Unknown Source)
        at 
 org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:841)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)

        at org.godhuli.f.RHMR.submitAndMonitorJob(RHMR.java:195)

 but the value  of mapred.local.dir is /mnt/hadoop/mapred/local

 Any ideas?



Is it possible to run multiple mapreduce jobs from within the same application

2010-02-23 Thread Raymond Jennings III
In other words:  I have a situation where I want to feed the output from the 
first iteration of my mapreduce job to a second iteration and so on.  I have a 
for loop in my main method to setup the job parameters and to run it through 
all iterations but on about the third run the Hadoop processes lose their 
association with the 'jps' command and then weird things start happening.  I 
remember reading somewhere about chaining - is that what is needed?  I'm not 
sure what causes jps to not report the hadoop processes even though they are 
still active as can be seen with the ps command.  Thanks.  (This is on 
version 0.20.1)


  


CDH2 or Apache Hadoop

2010-02-23 Thread Ananth Sarathy
Just wanted to get the groups general feelings on what the preferred distro
is and why? Obviously assuming one didn't have a service agreement with
cloudera.


Ananth T Sarathy


applying MAPREDUCE-318 to 0.20.1

2010-02-23 Thread Massoud Mazar
Is there a patch that I can apply to my 0.20.1 cluster that fixes issue 
reported by Scott Carey:

Comment out or delete the line:
break; //we have a map from this host
in ReduceOutput.java in ReduceCopier.fetchOutputs()
- line 1804 in 0.19.2-dev, 1911 on 0.20.1-dev and line 1954 currently on trunk.
Thanks



Re: Many child processes dont exit

2010-02-23 Thread Zheng Lv
Thank you Jason, your reply is helpful.

2010/2/23 Jason Venner jason.had...@gmail.com

 Someone is using a threadpool that does not have daemon priority
 threads, and that is not shutdown before the main method returns.
 The non daemon threads prevent the jvm from exiting.
 We had this problem for a while and modified the Child.main to exit,
 rather than trying to work out and fix the third party library that
 ran the thread pool. This thecnique does of  prevent jvm reuse.

 On Sat, Feb 20, 2010 at 6:49 AM, Ted Yu yuzhih...@gmail.com wrote:
  Do you have System.exit() as the last line in your main() ?
 Job job = createSubmittableJob(conf, otherArgs);
 System.exit(job.waitForCompletion(true)? 0 : 1);
 
 
  On Sat, Feb 20, 2010 at 12:32 AM, Zheng Lv lvzheng19800...@gmail.com
 wrote:
 
  Hello Ted,
   Yes. Every hour a job will be created and started, and when it
 finished,
  it will maintain. The logs looks like normal, do you know what can lead
 to
  this happen?Thank you.
 LvZheng
  2010/2/20 Ted Yu yuzhih...@gmail.com
 
   Did the number of child processes increase over time ?
  
   On Friday, February 19, 2010, Zheng Lv lvzheng19800...@gmail.com
  wrote:
Hello Edson,
  Thank you for your reply. I don't want to kill them, I want to
 know
  why
these child processes don't exit, and to know how to make them exit
successfully when they finish. Any ideas? Thank you.
LvZheng
   
2010/2/18 Edson Ramiro erlfi...@gmail.com
   
Do you want to kill them ?
   
if yes, you can use
   
 ./bin/slaves.sh pkill java
   
but it will kill the datanode and tasktracker processes
in all slaves and you'll need to start these processes again.
   
Edson Ramiro
   
   
On 14 February 2010 22:09, Zheng Lv lvzheng19800...@gmail.com
  wrote:
   
 any idea?

 2010/2/11 Zheng Lv lvzheng19800...@gmail.com

  Hello Everyone,
We often find many child processes in datanodes, which have
   already
  finished for long time. And following are the jstack log:
Full thread dump Java HotSpot(TM) 64-Bit Server VM (14.3-b01
  mixed
 mode):
  DestroyJavaVM prio=10 tid=0x2aaac8019800 nid=0x2422
 waiting
  on
  condition [0x]
 java.lang.Thread.State: RUNNABLE
  NioProcessor-31 prio=10 tid=0x439fa000 nid=0x2826
  runnable
  [0x4100a000]
 java.lang.Thread.State: RUNNABLE
  at sun.nio.ch.EPollArrayWrapper.epollWait(Native
 Method)
  at
sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:215)
  at
 sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:65)
  at
sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
  - locked 0x2aaab9b5f6f8 (a sun.nio.ch.Util$1)
  - locked 0x2aaab9b5f710 (a
  java.util.Collections$UnmodifiableSet)
  - locked 0x2aaab9b5f680 (a
   sun.nio.ch.EPollSelectorImpl)
  at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
  at
 

   
  
 
 org.apache.mina.transport.socket.nio.NioProcessor.select(NioProcessor.java:65)
  at
 

   
  
 
 org.apache.mina.common.AbstractPollingIoProcessor$Worker.run(AbstractPollingIoProcessor.java:672)
  at
 

   
  
 
 org.apache.mina.util.NamePreservingRunnable.run(NamePreservingRunnable.java:51)
  at
 

   
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
  at
 

   
  
 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
  at java.lang.Thread.run(Thread.java:619)
  pool-15-thread-1 prio=10 tid=0x2aaac802d000 nid=0x2825
  waiting
   on
  condition [0x41604000]
 java.lang.Thread.State: WAITING (parking)
  at sun.misc.Unsafe.park(Native Method)
  - parking to wait for  0x2aaab9b61620 (a
 
   java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
  at
 
 java.util.concurrent.locks.LockSupport.park(LockSupport.java:158)
  at
 

   
  
 
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1925)
  at
 

   
  
 
 java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:358)
  at
 

   
  
 
 java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:947)
  at
 

java.util.concurrent.ThreadPoolExecutor$Worker.run(Threa
  
 
 



 --
 Pro Hadoop, a book to guide you from beginner to hadoop mastery,
 http://www.amazon.com/dp/1430219424?tag=jewlerymall
 www.prohadoopbook.com a community for Hadoop Professionals



Re: Job Tracker questions

2010-02-23 Thread Jeff Zhang
This is the sample code for get the counters for one specified job ( I have
tested it on my cluster).  What you need to change is the jobtracker address
and jobID. Remember to put this class in package org.apache.hadoop.mapred,
becuase the JobSubmissionProtocol is not public.

package org.apache.hadoop.mapred;

import java.io.IOException;
import java.net.InetSocketAddress;
import java.util.Iterator;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.ipc.RPC;
import org.apache.hadoop.mapred.Counters.Counter;
import org.apache.hadoop.mapred.Counters.Group;

public class MyServer {

public static void main(String[] args) throws IOException {
InetSocketAddress address = new InetSocketAddress(sha-cs-04,
9001);
JobSubmissionProtocol jobClient = (JobSubmissionProtocol)
RPC.getProxy(
JobSubmissionProtocol.class,
JobSubmissionProtocol.versionID,
address, new Configuration());
JobID jobID=new JobID(201002111947,460);
Counters counters= jobClient.getJobCounters(jobID);
IteratorGroup iter=counters.iterator();
while(iter.hasNext()){
Group group=iter.next();
System.out.println(group.getDisplayName());
IteratorCounter cIter=group.iterator();
while(cIter.hasNext()){
Counter counter=cIter.next();

System.out.println(\t+counter.getName()+:+counter.getValue());
}
}
}
}

On Tue, Feb 23, 2010 at 9:38 PM, Mark N nipen.m...@gmail.com wrote:

 i have to find Counter and i am using inetsocketAddress and connecting
 into the jobtracker jsp
 with help of it am able to find the counter when it is running on pseudo
 mode

 but as i am trying to run this in cluster . i am not able to get any
 counters . SI there any parameters that I need to change in
 configuration to read the counter information ?


 Thanks in advance


 On Tue, Feb 9, 2010 at 8:49 PM, Jeff Zhang zjf...@gmail.com wrote:

  JobClient also use proxy of JobTracker.
 
 
  On Mon, Feb 8, 2010 at 11:19 PM, Mark N nipen.m...@gmail.com wrote:
 
   Did you check the jobClient source code?
  
  
   On Thu, Feb 4, 2010 at 5:21 PM, Jeff Zhang zjf...@gmail.com wrote:
  
I look at the source code, it seems the job tracker web ui also use
 the
proxy of JobTracker to get the counter information rather the xml
 file.
   
   
On Thu, Feb 4, 2010 at 7:29 PM, Mark N nipen.m...@gmail.com wrote:
   
 yes we can create a webservice in java which would be called by
 .net
  to
 display these counters.

 But since the java code to read these counters needs use hadoop
 APIs
   (
job
 client  ) ,  am not sure we can create a webservice to read the
   counters

 Question is how does the default hadoop task tracker display
 counter
 information in JSP pages ? does it read from the XML files ?

 thanks,

 On Thu, Feb 4, 2010 at 5:08 PM, Jeff Zhang zjf...@gmail.com
 wrote:

  I think you can create web service using Java, and then in .net
  using
the
  web service to display the result.
 
 
  On Thu, Feb 4, 2010 at 7:21 PM, Jeff Zhang zjf...@gmail.com
  wrote:
 
   Do you mean want to connect the JobTracker using .Net ? If so,
  I'm
 afraid
  I
   have no idea how to this. The rpc of hadoop is language
  dependent.
  
  
  
  
   On Thu, Feb 4, 2010 at 7:18 PM, Mark N nipen.m...@gmail.com
   wrote:
  
   could you please elaborate on this  ( * hint to get started
  as
  am
 very
   new
   to hadoop? )
   So far I could succesfully read all the default and custom
   counters.
  
   Currently we are having a .net client.
  
   thanks in advance.
  
  
   On Thu, Feb 4, 2010 at 4:53 PM, Jeff Zhang zjf...@gmail.com
wrote:
  
Well, you can create a proxy of JobTracker in client side,
 and
then
  you
   can
use the API of JobTracker to get the information of jobs.
 The
Proxy
  take
the
responsibility of  communication with the Master Node.  Read
  the
  source
code
of JobClient can help you.
   
   
On Thu, Feb 4, 2010 at 6:59 PM, Mark N 
 nipen.m...@gmail.com
 wrote:
   
 Ye currently am using jobclient to read these counters.

 But We are not able to use *webservices *because the jar
  which
is
  used
   to
 read the counters from  running hadoop job  is itself a
  Hadoop
  program

 If we could have pure Java Api which is run without hadoop
command
   then
we
 could return the counter variable into webservices and
 show
  in
UI.

 Any help  or technique to show thsese counters in the UI
  would
be
 appreciated  ( not necessarily using web service )


 I am using webservices because I am 

Re: MultipleOutputs and new 20.1 API

2010-02-23 Thread Amareshwari Sri Ramadasu
MultipleOutputs is ported to use new api through 
http://issues.apache.org/jira/browse/MAPREDUCE-370. This change is done in 
branch 0.21.

Thanks
Amareshwari

On 2/23/10 5:42 PM, Chris White chriswhite...@googlemail.com wrote:

Does the code for MutipleOutputs work (as described in the Javadocs) for
the new 20.1 API? This object expects a JobConf to most of its methods,
which is deprecated in 20.1? If not then is there any plans to update
MultipleOutputs to work with the new API?



Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Amogh Vasekar
Hi,
Can you let us know what is the value for :
Map input records
Map spilled records
Map output bytes
Is there any side effect file written?

Thanks,
Amogh


On 2/23/10 8:57 PM, Tim Kiefer tim-kie...@gmx.de wrote:

No... 900GB is in the map column. Reduce adds another ~70GB of
FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

Am 23.02.2010 16:11, schrieb Ed Mazur:
 Hi Tim,

 I'm guessing a lot of these writes are happening on the reduce side.
 On the JT web interface, there are three columns: map, reduce,
 overall. Is the 900GB figure from the overall column? The value in the
 map column will probably be closer to what you were expecting. There
 are writes on the reduce side too during the shuffle and multi-pass
 merge.

 Ed

 2010/2/23 Tim Kiefer tim-kie...@gmx.de:

 Hi Gang,

 thanks for your reply.

 To clarify: I look at the statistics through the job tracker. In the
 webinterface for my job I have columns for map, reduce and total. What I
 was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
 Output Bytes in the map column.

 About the replication factor: I would expect the exact same thing -
 changing to 6 has no influence on FILE_BYTES_WRITTEN.

 About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
 Furthermore, I have 40 mappers and map output data is ~300GB. I can't
 see how that ends up in a factor 3?

 - tim

 Am 23.02.2010 14:39, schrieb Gang Luo:

 Hi Tim,
 the intermediate data is materialized to local file system. Before it is 
 available for reducers, mappers will sort them. If the buffer (io.sort.mb) 
 is too small for the intermediate data, multi-phase sorting happen, which 
 means you read and write the same bit more than one time.

 Besides, are you looking at the statistics per mapper through the job 
 tracker, or just the information output when a job finish? If you look at 
 the information given out at the end of the job, note that this is an 
 overall statistics which include sorting at reduce side. It also include 
 the amount of data written to HDFS (I am not 100% sure).

 And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. 
 I think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


 Hi there,

 can anybody help me out on a (most likely) simple unclarity.

 I am wondering how intermediate key/value pairs are materialized. I have a 
 job where the map phase produces 600,000 records and map output bytes is 
 ~300GB. What I thought (up to now) is that these 600,000 records, i.e., 
 300GB, are materialized locally by the mappers and that later on reducers 
 pull these records (based on the key).
 What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is 
 as high as ~900GB.

 So - where does the factor 3 come from between Map output bytes and 
 FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the 
 file system - but that should be HDFS only?!

 Thanks
 - tim






Re: CDH2 or Apache Hadoop

2010-02-23 Thread Dali Kilani
I've had a great experience with CDH2 on various platforms (Ubuntu,
OpenSolaris). Worked as advertised.
My 2 cents.

On Tue, Feb 23, 2010 at 3:13 PM, Ananth Sarathy
ananth.t.sara...@gmail.comwrote:

 Just wanted to get the groups general feelings on what the preferred distro
 is and why? Obviously assuming one didn't have a service agreement with
 cloudera.


 Ananth T Sarathy




-- 
Dali Kilani
===
Twitter :  @dadicool
Phone :  (650) 492-5921 (Google Voice)
E-Fax  :  (775) 552-2982


Re: CDH2 or Apache Hadoop

2010-02-23 Thread Zak Stone
I have been very grateful for the Cloudera Hadoop distributions. They
are well-documented and current, and the Cloudera folks are quick to
answer questions both in the issue-tracking system for their customers
and on the public mailing lists.

Zak


On Tue, Feb 23, 2010 at 11:59 PM, Eric Sammer e...@lifeless.net wrote:
 On 2/23/10 6:13 PM, Ananth Sarathy wrote:
 Just wanted to get the groups general feelings on what the preferred distro
 is and why? Obviously assuming one didn't have a service agreement with
 cloudera.

 Ananth:

 CDH pros:

 - Nice init scripts.
 - Integration with the alternatives management system.
 - Nice back ported patches.
 - Yum / Apt repository integration / easy updates.
 - Frequent bug fixes released.

 CDH cons:

 - Filing bugs with ASF is a bit weird because you may have slightly
 different code due to patches. This can be mitigated by filing bugs with
 Cloudera and letting them go upstream.

 I'm not entirely neutral, but CDH has little in the way of cons to me.

 Hope that helps.
 --
 Eric Sammer
 e...@lifeless.net
 http://esammer.blogspot.com



Re: How are intermediate key/value pairs materialized between map and reduce?

2010-02-23 Thread Tim Kiefer

Sure,
I see:
Map input eecords: 10,000
Map output records: 600,000
Map output bytes: 307,216,800,000  (each reacord is about 500kb - that 
fits the application and is to be expected)


Map spilled records: 1,802,965 (ahhh... now that you ask for it - here 
there also is a factor of 3 between output and spilled).


So - question now is: why are three times as many records spilled than 
actually produced by the mappers?


In my map function, I do not perform any additional file writing besides 
the context.write() for the intermediate records.


Thanks, Tim

Am 24.02.2010 05:28, schrieb Amogh Vasekar:

Hi,
Can you let us know what is the value for :
Map input records
Map spilled records
Map output bytes
Is there any side effect file written?

Thanks,
Amogh


On 2/23/10 8:57 PM, Tim Kiefertim-kie...@gmx.de  wrote:

No... 900GB is in the map column. Reduce adds another ~70GB of
FILE_BYTES_WRITTEN and the total column consequently shows ~970GB.

Am 23.02.2010 16:11, schrieb Ed Mazur:
   

Hi Tim,

I'm guessing a lot of these writes are happening on the reduce side.
On the JT web interface, there are three columns: map, reduce,
overall. Is the 900GB figure from the overall column? The value in the
map column will probably be closer to what you were expecting. There
are writes on the reduce side too during the shuffle and multi-pass
merge.

Ed

2010/2/23 Tim Kiefertim-kie...@gmx.de:

 

Hi Gang,

thanks for your reply.

To clarify: I look at the statistics through the job tracker. In the
webinterface for my job I have columns for map, reduce and total. What I
was refering to is map - i.e. I see FILE_BYTES_WRITTEN = 3 * Map
Output Bytes in the map column.

About the replication factor: I would expect the exact same thing -
changing to 6 has no influence on FILE_BYTES_WRITTEN.

About the sorting: I have io.sort.mb = 100 and io.sort.factor = 10.
Furthermore, I have 40 mappers and map output data is ~300GB. I can't
see how that ends up in a factor 3?

- tim

Am 23.02.2010 14:39, schrieb Gang Luo:

   

Hi Tim,
the intermediate data is materialized to local file system. Before it is 
available for reducers, mappers will sort them. If the buffer (io.sort.mb) is 
too small for the intermediate data, multi-phase sorting happen, which means 
you read and write the same bit more than one time.

Besides, are you looking at the statistics per mapper through the job tracker, 
or just the information output when a job finish? If you look at the 
information given out at the end of the job, note that this is an overall 
statistics which include sorting at reduce side. It also include the amount of 
data written to HDFS (I am not 100% sure).

And, the FILE-BYTES_WRITTEN has nothing to do with the replication factor. I 
think if you change the factor to 6, FILE_BYTES_WRITTEN is still the same.

  -Gang


Hi there,

can anybody help me out on a (most likely) simple unclarity.

I am wondering how intermediate key/value pairs are materialized. I have a job 
where the map phase produces 600,000 records and map output bytes is ~300GB. 
What I thought (up to now) is that these 600,000 records, i.e., 300GB, are 
materialized locally by the mappers and that later on reducers pull these 
records (based on the key).
What I see (and cannot explain) is that the FILE_BYTES_WRITTEN counter is as 
high as ~900GB.

So - where does the factor 3 come from between Map output bytes and 
FILE_BYTES_WRITTEN??? I thought about the replication factor of 3 in the file 
system - but that should be HDFS only?!

Thanks
- tim