Re: input file order

2012-04-02 Thread madhu phatak
Hi,
 Mappers run in parallel. So without reducer it is not possible ensure the
sequence.

On Fri, Jan 20, 2012 at 2:32 AM, Mapred Learn mapred.le...@gmail.comwrote:

 This is my question too.
 What if I want output to be in same order as input without using reducers.

 Thanks,
 JJ

 Sent from my iPhone

 On Jan 19, 2012, at 12:19 PM, Ronald Petty ronald.pe...@gmail.com wrote:

  Daniel,
 
  Can you provide a concrete example of what you mean by output to be in
 an
  orderly manner?
 
  Also, what are the file sizes and types?
 
  Ron
 
  On Thu, Jan 19, 2012 at 11:19 AM, Daniel Yehdego
  dtyehd...@miners.utep.eduwrote:
 
 
  Hi,
  I have 100 .txt input files and I want my mapper output to be in an
  orderly manner. I am not using any reducer.Any idea?
 
  Regards,
 
 
 




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: Multiple linear Regression on Hadoop

2012-04-02 Thread madhu phatak
Hi ,
 Nectar already implemented  Multiple Linear Regression. You can look into
the code here  https://github.com/zinnia-phatak-dev/Nectar .

On Fri, Jan 13, 2012 at 11:24 AM, Saurabh Bajaj
saurabh.ba...@mu-sigma.comwrote:

 Hi All,

 Could someone guide me how we can do a multiple linear regression on
 Hadoop.
 Mahout doesn't yet support Multiple Linear Regression.

 Saurabh Bajaj | Senior Business Analyst | +91 9986588089 |
 www.mu-sigma.comhttp://www.mu-sigma.com/ |
 ---Your problem isn't motivation, but execution - Peter Bregman---


 
 This email message may contain proprietary, private and confidential
 information. The information transmitted is intended only for the person(s)
 or entities to which it is addressed. Any review, retransmission,
 dissemination or other use of, or taking of any action in reliance upon,
 this information by persons or entities other than the intended recipient
 is prohibited and may be illegal. If you received this in error, please
 contact the sender and delete the message from your system.

 Mu Sigma takes all reasonable steps to ensure that its electronic
 communications are free from viruses. However, given Internet
 accessibility, the Company cannot accept liability for any virus introduced
 by this e-mail or any attachment and you are advised to use up-to-date
 virus checking software.




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: 0 tasktrackers in jobtracker but all datanodes present

2012-04-02 Thread madhu phatak
Hi,
1. Stop the job tracker and task trackers.  - bin/stop-mapred.sh

 2. Disable namenode safemode - bin/hadoop dfsadmin -safemode leave

3. Start the job tracker and tasktrackers again - bin/start-mapred.sh

On Fri, Jan 13, 2012 at 5:20 AM, Ravi Prakash ravihad...@gmail.com wrote:

 Courtesy Kihwal and Bobby

 Have you tried increasing the max heap size with -Xmx? and make sure that
 you have swap enabled.

 On Wed, Jan 11, 2012 at 6:59 PM, Gaurav Bagga gbagg...@gmail.com wrote:

  Hi
 
  hadoop-0.19
  I have a working hadoop cluster which has been running perfectly for
  months.
  But today after restarting the cluster, at jobtracker UI its showing
 state
  INITIALIZING for a long time and is staying on the same state.
  The nodes in jobtracker are zero whereas all the nodes are present on the
  dfs.
  It says Safe mode is on.
  grep'ed on slaves and I see the tasktrackers running.
 
  In namenode logs i get the following error
 
 
  2012-01-11 16:50:57,195 WARN  ipc.Server - Out of Memory in server select
  java.lang.OutOfMemoryError: Java heap space
 at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:39)
 at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
 at
  org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:804)
 at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:400)
 at org.apache.hadoop.ipc.Server$Listener.run(Server.java:309)
 
  Not sure why the cluster is not coming up
  -G
 




-- 
https://github.com/zinnia-phatak-dev/Nectar


Re: 0 tasktrackers in jobtracker but all datanodes present

2012-04-02 Thread Bejoy Ks
Gaurav
   NN memory might have hit its upper bound. As a bench mark, for every
1 million files/blocks/directories 1GB of memory is required on the NN. The
number of files in your cluster might have grown beyond this treshold. So
the options left for you would be
- If there are large number of small files, use HAR or Sequence File for
grouping the same
- Increase the NN heap

Regards
Bejoy KS

On Mon, Apr 2, 2012 at 12:08 PM, madhu phatak phatak@gmail.com wrote:

 Hi,
 1. Stop the job tracker and task trackers.  - bin/stop-mapred.sh

  2. Disable namenode safemode - bin/hadoop dfsadmin -safemode leave

 3. Start the job tracker and tasktrackers again - bin/start-mapred.sh

 On Fri, Jan 13, 2012 at 5:20 AM, Ravi Prakash ravihad...@gmail.com
 wrote:

  Courtesy Kihwal and Bobby
 
  Have you tried increasing the max heap size with -Xmx? and make sure that
  you have swap enabled.
 
  On Wed, Jan 11, 2012 at 6:59 PM, Gaurav Bagga gbagg...@gmail.com
 wrote:
 
   Hi
  
   hadoop-0.19
   I have a working hadoop cluster which has been running perfectly for
   months.
   But today after restarting the cluster, at jobtracker UI its showing
  state
   INITIALIZING for a long time and is staying on the same state.
   The nodes in jobtracker are zero whereas all the nodes are present on
 the
   dfs.
   It says Safe mode is on.
   grep'ed on slaves and I see the tasktrackers running.
  
   In namenode logs i get the following error
  
  
   2012-01-11 16:50:57,195 WARN  ipc.Server - Out of Memory in server
 select
   java.lang.OutOfMemoryError: Java heap space
  at java.nio.HeapByteBuffer.init(HeapByteBuffer.java:39)
  at java.nio.ByteBuffer.allocate(ByteBuffer.java:312)
  at
   org.apache.hadoop.ipc.Server$Connection.readAndProcess(Server.java:804)
  at org.apache.hadoop.ipc.Server$Listener.doRead(Server.java:400)
  at org.apache.hadoop.ipc.Server$Listener.run(Server.java:309)
  
   Not sure why the cluster is not coming up
   -G
  
 



 --
 https://github.com/zinnia-phatak-dev/Nectar



mapred.child.java.opts and mapreduce.reduce.java.opts

2012-04-02 Thread Juan Pino
Hello,

I have a job that requires a bit more memory than the default for the
reducer (not for the mapper).
So for this I have this property in my configuration file:

mapreduce.reduce.java.opts=-Xmx4000m

When I run the job, I can see its configuration in the web interface and I
see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m
but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef
the java process, it is using -Xmx200m.

So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in my
configuration file.
However I don't need that much memory for the mapper.
How can I set more memory only for the mapper ? Is the only solution to set
mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to -Xmx4000m
and mapreduce.map.java.opts to -Xmx200m ?

I am using hadoop 1.0.1.

Thank you very much,

Juan


Image Processing in Hadoop

2012-04-02 Thread Shreya.Pal


Hi,



Can someone point me to some info on Image processing using Hadoop?



Regards,

Shreya


This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.


Re: Working with MapFiles

2012-04-02 Thread Ioan Eugen Stan

Hi Ondrej,

Pe 30.03.2012 14:30, Ondřej Klimpera a scris:

And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache?
Thanks


Should be no problem, they are just two files.


On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:

Hello,

I'm not sure what you mean by using map reduce setup()?

If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

Can you please explain little bit more?



Check the javadocs[1]: setup is called once per task so you can read the 
file from HDFS then or perform other initializations.


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html 



Reading 20 MB in ram should not be a problem and is preferred if you 
need to make many requests against that data. It really depends on your 
use case so think carefully or just go ahead and test it.




Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to
do is:

1. If MapReduce produced more spilts as Output, merge them to single
file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted
keys) and a small index for that file. The map file does a version of
binary search to find your key and performs seek() to go to the byte
offset in the file.


What I'm trying to achieve is repeatedly fast search in this file
during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it
will provide you with any benefits.


Thanks for your reply:)

Ondrej Klimpera


[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html

[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html








--
Ioan Eugen Stan
http://ieugen.blogspot.com


Re: Working with MapFiles

2012-04-02 Thread Ondřej Klimpera

Ok, thanks.

I missed setup() method because of using older version of hadoop, so I 
suppose that method configure() does the same in hadoop 0.20.203.


Now I'm able to load a map file inside configure() method to 
MapFile.Reader instance as a class private variable, all works fine, 
just wondering if the MapFile is replicated on HDFS and data are read 
locally, or if reading from this file will increase the network 
bandwidth because of getting it's data from another computer node in the 
hadoop cluster.


Hopefully last question to bother you is, if reading files from 
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in 
Reducer implementation, can it access DistCache files from another jobs?

In another words what will list this command:

//Reducer impl.
public void configure(JobConf job) {

 URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);

}

will the distCacheFileUris variable contain only URIs for this job, or 
for any job running on Hadoop cluster?


Hope it's understandable.
Thanks.

On 04/02/2012 11:34 AM, Ioan Eugen Stan wrote:

Hi Ondrej,

Pe 30.03.2012 14:30, Ondřej Klimpera a scris:

And one more question, is it even possible to add a MapFile (as it
consits of index and data file) to Distributed cache?
Thanks


Should be no problem, they are just two files.


On 03/30/2012 01:15 PM, Ondřej Klimpera wrote:

Hello,

I'm not sure what you mean by using map reduce setup()?

If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

Can you please explain little bit more?



Check the javadocs[1]: setup is called once per task so you can read 
the file from HDFS then or perform other initializations.


[1] 
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Mapper.html 



Reading 20 MB in ram should not be a problem and is preferred if you 
need to make many requests against that data. It really depends on 
your use case so think carefully or just go ahead and test it.




Thanks


On 03/30/2012 12:49 PM, Ioan Eugen Stan wrote:

Hello Ondrej,


Pe 29.03.2012 18:05, Ondřej Klimpera a scris:

Hello,

I have a MapFile as a product of MapReduce job, and what I need to
do is:

1. If MapReduce produced more spilts as Output, merge them to single
file.

2. Copy this merged MapFile to another HDFS location and use it as a
Distributed cache file for another MapReduce job.
I'm wondering if it is even possible to merge MapFiles according to
their nature and use them as Distributed cache file.


A MapFile is actually two files [1]: one SequanceFile (with sorted
keys) and a small index for that file. The map file does a version of
binary search to find your key and performs seek() to go to the byte
offset in the file.


What I'm trying to achieve is repeatedly fast search in this file
during
another MapReduce job.
If my idea is absolute wrong, can you give me any tip how to do it?

The file is supposed to be 20MB large.
I'm using Hadoop 0.20.203.


If the file is that small you could load it all in memory to avoid
network IO. Do that in the setup() method of the map reduce job.

The distributed cache will also use HDFS [2] and I don't think it
will provide you with any benefits.


Thanks for your reply:)

Ondrej Klimpera


[1]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/io/MapFile.html 



[2]
http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/filecache/DistributedCache.html 














Re: Image Processing in Hadoop

2012-04-02 Thread Sujit Dhamale
Shreya  can u please Explain your scenario .


On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote:



 Hi,



 Can someone point me to some info on Image processing using Hadoop?



 Regards,

 Shreya


 This e-mail and any files transmitted with it are for the sole use of the
 intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by reply
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on this
 e-mail is strictly prohibited and may be unlawful.



Yuan Jin is out of the office.

2012-04-02 Thread Yuan Jin

I will be out of the office starting  04/02/2012 and will not return until
04/05/2012.

I am out of office, and will reply you when I am back.


Re: Image Processing in Hadoop

2012-04-02 Thread madhu phatak
Hi  Shreya,
 Image files binary files . Use SequenceFile format to store the image in
hdfs and SequenceInputFormat to read the bytes . You can use TwoDWritable
to store matrix for image.
On Mon, Apr 2, 2012 at 3:36 PM, Sujit Dhamale sujitdhamal...@gmail.comwrote:

 Shreya  can u please Explain your scenario .


 On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote:

 
 
  Hi,
 
 
 
  Can someone point me to some info on Image processing using Hadoop?
 
 
 
  Regards,
 
  Shreya
 
 
  This e-mail and any files transmitted with it are for the sole use of the
  intended recipient(s) and may contain confidential and privileged
  information.
  If you are not the intended recipient, please contact the sender by reply
  e-mail and destroy all copies of the original message.
  Any unauthorized review, use, disclosure, dissemination, forwarding,
  printing or copying of this email or any action taken in reliance on this
  e-mail is strictly prohibited and may be unlawful.
 




-- 
https://github.com/zinnia-phatak-dev/Nectar


RE: Image Processing in Hadoop

2012-04-02 Thread Shreya.Pal
Hi,

My scenario is:
There are some images of Structures (Building plans etc) that have to be
stored in HDFS, If the user click on a door of that building, I want to
use mapreduce to display the corresponding  door image stored in HDFS
and all the information related to it. In a nut shell an image has to be
displayed and based on user click, need to drill down into the image

Thanks and Regards,
Shreya Pal

-Original Message-
From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com]
Sent: Monday, April 02, 2012 3:36 PM
To: common-user@hadoop.apache.org
Subject: Re: Image Processing in Hadoop

Shreya  can u please Explain your scenario .


On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote:



 Hi,



 Can someone point me to some info on Image processing using Hadoop?



 Regards,

 Shreya


 This e-mail and any files transmitted with it are for the sole use of
 the intended recipient(s) and may contain confidential and privileged
 information.
 If you are not the intended recipient, please contact the sender by
 reply e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding,
 printing or copying of this email or any action taken in reliance on
 this e-mail is strictly prohibited and may be unlawful.

This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.



Re: Working with MapFiles

2012-04-02 Thread Ioan Eugen Stan

Hi Ondrej,

Pe 02.04.2012 13:00, Ondřej Klimpera a scris:

Ok, thanks.

I missed setup() method because of using older version of hadoop, so I
suppose that method configure() does the same in hadoop 0.20.203.


Aha, if it's possible, try upgrading. I don't know how support is for 
versions older then hadoop 0.20 branch.



Now I'm able to load a map file inside configure() method to
MapFile.Reader instance as a class private variable, all works fine,
just wondering if the MapFile is replicated on HDFS and data are read
locally, or if reading from this file will increase the network
bandwidth because of getting it's data from another computer node in the
hadoop cluster.



You could use a method variable instead of a class private if you load 
the file. If the MapFile is wrote to HDFS then yes it is replicated, and 
you can configure the replication factor at file creation (and later 
maybe). If you use DistributedCache then the files are not written in 
HDFS, but in mapred.local.dir [1] folder on every node.
The folder size is configurable so it's possible that the data will be 
available there for the next MR job but don't rely on this.


Please read the docs, I may get things wrong. RTFM will save you life ;).

[1] http://developer.yahoo.com/hadoop/tutorial/module5.html#auxdata
[2] https://forums.aws.amazon.com/message.jspa?messageID=152538


Hopefully last question to bother you is, if reading files from
DistributedCache (normal text file) is limited to particular job.
Before running a job I add a file to DistCache. When getting the file in
Reducer implementation, can it access DistCache files from another jobs?
In another words what will list this command:

//Reducer impl.
public void configure(JobConf job) {

URI[] distCacheFileUris = DistributedCache.getCacheFiles(job);

}

will the distCacheFileUris variable contain only URIs for this job, or
for any job running on Hadoop cluster?

Hope it's understandable.
Thanks.



It's

--
Ioan Eugen Stan
http://ieugen.blogspot.com


RE: Image Processing in Hadoop

2012-04-02 Thread Darren Govoni
This doesn't sound like a mapreduce[1] sort of problem. Now, of course,
you can store files in HDFS and retrieve them. But its up to your
application to interpret them. MapReduce cannot display the
corresponding door image, it is a computation scheme and performs
calculations that you provide.

[1] http://en.wikipedia.org/wiki/MapReduce

On Mon, 2012-04-02 at 15:52 +0530, shreya@cognizant.com wrote:
 Hi,
 
 My scenario is:
 There are some images of Structures (Building plans etc) that have to be
 stored in HDFS, If the user click on a door of that building, I want to
 use mapreduce to display the corresponding  door image stored in HDFS
 and all the information related to it. In a nut shell an image has to be
 displayed and based on user click, need to drill down into the image
 
 Thanks and Regards,
 Shreya Pal
 
 -Original Message-
 From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com] 
 Sent: Monday, April 02, 2012 3:36 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Image Processing in Hadoop
 
 Shreya  can u please Explain your scenario .
 
 
 On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote:
 
 
 
  Hi,
 
 
 
  Can someone point me to some info on Image processing using Hadoop?
 
 
 
  Regards,
 
  Shreya
 
 
  This e-mail and any files transmitted with it are for the sole use of 
  the intended recipient(s) and may contain confidential and privileged 
  information.
  If you are not the intended recipient, please contact the sender by 
  reply e-mail and destroy all copies of the original message.
  Any unauthorized review, use, disclosure, dissemination, forwarding, 
  printing or copying of this email or any action taken in reliance on 
  this e-mail is strictly prohibited and may be unlawful.
 
 This e-mail and any files transmitted with it are for the sole use of the 
 intended recipient(s) and may contain confidential and privileged information.
 If you are not the intended recipient, please contact the sender by reply 
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
 or copying of this email or any action taken in reliance on this e-mail is 
 strictly prohibited and may be unlawful.
 




RE: Image Processing in Hadoop

2012-04-02 Thread Shreya.Pal
Ya I understand that we need to write the processing logic, what I want to know 
is are there any kind of APIs that can be used for image processing,
Was reading about HIPI, is this the right API or webGL should be used?
Any other suggestions are welcome.

Thanks and Regards,
Shreya 

-Original Message-
From: Darren Govoni [mailto:dar...@ontrenet.com] 
Sent: Monday, April 02, 2012 4:47 PM
To: common-user@hadoop.apache.org
Subject: RE: Image Processing in Hadoop

This doesn't sound like a mapreduce[1] sort of problem. Now, of course, you can 
store files in HDFS and retrieve them. But its up to your application to 
interpret them. MapReduce cannot display the corresponding door image, it is 
a computation scheme and performs calculations that you provide.

[1] http://en.wikipedia.org/wiki/MapReduce

On Mon, 2012-04-02 at 15:52 +0530, shreya@cognizant.com wrote:
 Hi,
 
 My scenario is:
 There are some images of Structures (Building plans etc) that have to 
 be stored in HDFS, If the user click on a door of that building, I 
 want to use mapreduce to display the corresponding  door image stored 
 in HDFS and all the information related to it. In a nut shell an image 
 has to be displayed and based on user click, need to drill down into 
 the image
 
 Thanks and Regards,
 Shreya Pal
 
 -Original Message-
 From: Sujit Dhamale [mailto:sujitdhamal...@gmail.com]
 Sent: Monday, April 02, 2012 3:36 PM
 To: common-user@hadoop.apache.org
 Subject: Re: Image Processing in Hadoop
 
 Shreya  can u please Explain your scenario .
 
 
 On Mon, Apr 2, 2012 at 3:02 PM, shreya@cognizant.com wrote:
 
 
 
  Hi,
 
 
 
  Can someone point me to some info on Image processing using Hadoop?
 
 
 
  Regards,
 
  Shreya
 
 
  This e-mail and any files transmitted with it are for the sole use 
  of the intended recipient(s) and may contain confidential and 
  privileged information.
  If you are not the intended recipient, please contact the sender by 
  reply e-mail and destroy all copies of the original message.
  Any unauthorized review, use, disclosure, dissemination, forwarding, 
  printing or copying of this email or any action taken in reliance on 
  this e-mail is strictly prohibited and may be unlawful.
 
 This e-mail and any files transmitted with it are for the sole use of the 
 intended recipient(s) and may contain confidential and privileged information.
 If you are not the intended recipient, please contact the sender by reply 
 e-mail and destroy all copies of the original message.
 Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
 or copying of this email or any action taken in reliance on this e-mail is 
 strictly prohibited and may be unlawful.
 



This e-mail and any files transmitted with it are for the sole use of the 
intended recipient(s) and may contain confidential and privileged information.
If you are not the intended recipient, please contact the sender by reply 
e-mail and destroy all copies of the original message.
Any unauthorized review, use, disclosure, dissemination, forwarding, printing 
or copying of this email or any action taken in reliance on this e-mail is 
strictly prohibited and may be unlawful.

Re: mapred.child.java.opts and mapreduce.reduce.java.opts

2012-04-02 Thread Harsh J
For 1.0, the right property is mapred.reduce.child.java.opts. The
mapreduce.* style would apply to MR in 2.0 and above.

On Mon, Apr 2, 2012 at 3:00 PM, Juan Pino juancitomiguel...@gmail.com wrote:
 Hello,

 I have a job that requires a bit more memory than the default for the
 reducer (not for the mapper).
 So for this I have this property in my configuration file:

 mapreduce.reduce.java.opts=-Xmx4000m

 When I run the job, I can see its configuration in the web interface and I
 see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m
 but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef
 the java process, it is using -Xmx200m.

 So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in my
 configuration file.
 However I don't need that much memory for the mapper.
 How can I set more memory only for the mapper ? Is the only solution to set
 mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to -Xmx4000m
 and mapreduce.map.java.opts to -Xmx200m ?

 I am using hadoop 1.0.1.

 Thank you very much,

 Juan



-- 
Harsh J


Re: mapred.child.java.opts and mapreduce.reduce.java.opts

2012-04-02 Thread Juan Pino
Thank you that worked!

Juan

On Mon, Apr 2, 2012 at 12:55 PM, Harsh J ha...@cloudera.com wrote:

 For 1.0, the right property is mapred.reduce.child.java.opts. The
 mapreduce.* style would apply to MR in 2.0 and above.

 On Mon, Apr 2, 2012 at 3:00 PM, Juan Pino juancitomiguel...@gmail.com
 wrote:
  Hello,
 
  I have a job that requires a bit more memory than the default for the
  reducer (not for the mapper).
  So for this I have this property in my configuration file:
 
  mapreduce.reduce.java.opts=-Xmx4000m
 
  When I run the job, I can see its configuration in the web interface and
 I
  see that indeed I have mapreduce.reduce.java.opts set to -Xmx4000m
  but I also have mapred.child.java.opts set to -Xmx200m and when I ps -ef
  the java process, it is using -Xmx200m.
 
  So to make my job work I had to set mapred.child.java.opts=-Xmx4000m in
 my
  configuration file.
  However I don't need that much memory for the mapper.
  How can I set more memory only for the mapper ? Is the only solution to
 set
  mapred.child.java.opts to -Xmx4000m, mapreduce.reduce.java.opts to
 -Xmx4000m
  and mapreduce.map.java.opts to -Xmx200m ?
 
  I am using hadoop 1.0.1.
 
  Thank you very much,
 
  Juan



 --
 Harsh J



Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
Praveenesh,

If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
(hosts/groups) settings. You have to use explicit hosts/groups.

Thxs.

Alejandro
PS: please follow up this thread in the oozie-us...@incubator.apache.org

On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.comwrote:

 Hi all,

 I want to use oozie to submit different workflows from different users.
 These users are able to submit hadoop jobs.
 I am using hadoop 0.20.205 and oozie 3.1.3
 I have a hadoop user as a oozie-user

 I have set the following things :

 conf/oozie-site.xml :

  property 
  name oozie.services.ext /name 
  value org.apache.oozie.service.HadoopAccessorService
  /value 
  description 
 To add/replace services defined in 'oozie.services' with custom
 implementations.Class names must be separated by commas.
  /description 
  /property 

 conf/core-site.xml
  property
  namehadoop.proxyuser.hadoop.hosts /name
  value* / value
  /property
  property
  namehadoop.proxyuser.hadoop.groups /name
  value* /value
  /property

 When I am submitting jobs as a hadoop user, I am able to run it properly.
 But when I am able to submit the same work flow  from a different user, who
 can submit the simple MR jobs to my hadoop cluster, I am getting the
 following error:

 JA009: java.io.IOException: java.io.IOException: The username kumar
 obtained from the conf doesn't match the username hadoop the user
 authenticated asat
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at

 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

 at

 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

 at java.lang.reflect.Method.invoke(Method.java:597)
 at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
 at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at

 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)

 at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)

 Caused by: java.io.IOException: The username kumar obtained from the conf
 doesn't match the username hadoop the user authenticated as
 at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
 at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
 ... 11 more



Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread praveenesh kumar
How can I specify multiple users /groups for proxy user setting ?
Can I give comma separated values in these settings ?

Thanks,
Praveenesh

On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 Praveenesh,

 If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
 (hosts/groups) settings. You have to use explicit hosts/groups.

 Thxs.

 Alejandro
 PS: please follow up this thread in the oozie-us...@incubator.apache.org

 On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com
 wrote:

  Hi all,
 
  I want to use oozie to submit different workflows from different users.
  These users are able to submit hadoop jobs.
  I am using hadoop 0.20.205 and oozie 3.1.3
  I have a hadoop user as a oozie-user
 
  I have set the following things :
 
  conf/oozie-site.xml :
 
   property 
   name oozie.services.ext /name 
   value org.apache.oozie.service.HadoopAccessorService
   /value 
   description 
  To add/replace services defined in 'oozie.services' with custom
  implementations.Class names must be separated by commas.
   /description 
   /property 
 
  conf/core-site.xml
   property
   namehadoop.proxyuser.hadoop.hosts /name
   value* / value
   /property
   property
   namehadoop.proxyuser.hadoop.groups /name
   value* /value
   /property
 
  When I am submitting jobs as a hadoop user, I am able to run it properly.
  But when I am able to submit the same work flow  from a different user,
 who
  can submit the simple MR jobs to my hadoop cluster, I am getting the
  following error:
 
  JA009: java.io.IOException: java.io.IOException: The username kumar
  obtained from the conf doesn't match the username hadoop the user
  authenticated asat
  org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  at
 
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
 
  at
 
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
 
  at java.lang.reflect.Method.invoke(Method.java:597)
  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
  at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at
 
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
 
  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
 
  Caused by: java.io.IOException: The username kumar obtained from the conf
  doesn't match the username hadoop the user authenticated as
  at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
  at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
  ... 11 more
 



Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread Alejandro Abdelnur
multiple value are comma separated. keep in mind that valid values for
proxyuser groups, as the property name states are GROUPS, not USERS.

thxs.

Alejandro

On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.comwrote:

 How can I specify multiple users /groups for proxy user setting ?
 Can I give comma separated values in these settings ?

 Thanks,
 Praveenesh

 On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com
 wrote:

  Praveenesh,
 
  If I'm not mistaken 0.20.205 does not support wildcards for the proxyuser
  (hosts/groups) settings. You have to use explicit hosts/groups.
 
  Thxs.
 
  Alejandro
  PS: please follow up this thread in the oozie-us...@incubator.apache.org
 
  On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com
  wrote:
 
   Hi all,
  
   I want to use oozie to submit different workflows from different users.
   These users are able to submit hadoop jobs.
   I am using hadoop 0.20.205 and oozie 3.1.3
   I have a hadoop user as a oozie-user
  
   I have set the following things :
  
   conf/oozie-site.xml :
  
property 
name oozie.services.ext /name 
value org.apache.oozie.service.HadoopAccessorService
/value 
description 
   To add/replace services defined in 'oozie.services' with custom
   implementations.Class names must be separated by commas.
/description 
/property 
  
   conf/core-site.xml
property
namehadoop.proxyuser.hadoop.hosts /name
value* / value
/property
property
namehadoop.proxyuser.hadoop.groups /name
value* /value
/property
  
   When I am submitting jobs as a hadoop user, I am able to run it
 properly.
   But when I am able to submit the same work flow  from a different user,
  who
   can submit the simple MR jobs to my hadoop cluster, I am getting the
   following error:
  
   JA009: java.io.IOException: java.io.IOException: The username kumar
   obtained from the conf doesn't match the username hadoop the user
   authenticated asat
   org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
   at
  
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
  
   at
  
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
  
   at java.lang.reflect.Method.invoke(Method.java:597)
   at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
   at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
   at java.security.AccessController.doPrivileged(Native Method)
   at javax.security.auth.Subject.doAs(Subject.java:396)
   at
  
  
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
  
   at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
  
   Caused by: java.io.IOException: The username kumar obtained from the
 conf
   doesn't match the username hadoop the user authenticated as
   at
 org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
   at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
   ... 11 more
  
 



Re: How can I configure oozie to submit different workflows from different users ?

2012-04-02 Thread praveenesh kumar
Is this a problem of proxy setting ? because after specifying the group
name also, I am not able to run it. Its still giving me the same error.

Thanks,
Praveenesh

On Mon, Apr 2, 2012 at 6:05 PM, Alejandro Abdelnur t...@cloudera.comwrote:

 multiple value are comma separated. keep in mind that valid values for
 proxyuser groups, as the property name states are GROUPS, not USERS.

 thxs.

 Alejandro

 On Mon, Apr 2, 2012 at 2:27 PM, praveenesh kumar praveen...@gmail.com
 wrote:

  How can I specify multiple users /groups for proxy user setting ?
  Can I give comma separated values in these settings ?
 
  Thanks,
  Praveenesh
 
  On Mon, Apr 2, 2012 at 5:52 PM, Alejandro Abdelnur t...@cloudera.com
  wrote:
 
   Praveenesh,
  
   If I'm not mistaken 0.20.205 does not support wildcards for the
 proxyuser
   (hosts/groups) settings. You have to use explicit hosts/groups.
  
   Thxs.
  
   Alejandro
   PS: please follow up this thread in the
 oozie-us...@incubator.apache.org
  
   On Mon, Apr 2, 2012 at 2:15 PM, praveenesh kumar praveen...@gmail.com
   wrote:
  
Hi all,
   
I want to use oozie to submit different workflows from different
 users.
These users are able to submit hadoop jobs.
I am using hadoop 0.20.205 and oozie 3.1.3
I have a hadoop user as a oozie-user
   
I have set the following things :
   
conf/oozie-site.xml :
   
 property 
 name oozie.services.ext /name 
 value org.apache.oozie.service.HadoopAccessorService
 /value 
 description 
To add/replace services defined in 'oozie.services' with custom
implementations.Class names must be separated by commas.
 /description 
 /property 
   
conf/core-site.xml
 property
 namehadoop.proxyuser.hadoop.hosts /name
 value* / value
 /property
 property
 namehadoop.proxyuser.hadoop.groups /name
 value* /value
 /property
   
When I am submitting jobs as a hadoop user, I am able to run it
  properly.
But when I am able to submit the same work flow  from a different
 user,
   who
can submit the simple MR jobs to my hadoop cluster, I am getting the
following error:
   
JA009: java.io.IOException: java.io.IOException: The username kumar
obtained from the conf doesn't match the username hadoop the user
authenticated asat
org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3943)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
   
   
  
 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
   
at
   
   
  
 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
   
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:563)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1388)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1384)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
   
   
  
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
   
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1382)
   
Caused by: java.io.IOException: The username kumar obtained from the
  conf
doesn't match the username hadoop the user authenticated as
at
  org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:426)
at
 org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3941)
... 11 more
   
  
 



Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Harsh J
Jay,

What does your job do? Create files directly on HDFS? If so, do you
follow this method?:
http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

A local filesystem may not complain if you re-create an existent file.
HDFS' behavior here is different. This simple Python test is what I
mean:
 a = open('a', 'w')
 a.write('f')
 b = open('a', 'w')
 b.write('s')
 a.close(), b.close()
 open('a').read()
's'

Hence it is best to use the FileOutputCommitter framework as detailed
in the mentioned link.

On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote:
 Hi guys:

 I have a map reduce job that runs normally on local file system from
 eclipse, *but* it fails on HDFS running in psuedo distributed mode.

 The exception I see is

 *org.apache.hadoop.ipc.RemoteException:
 org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*


 Any thoughts on why this might occur in psuedo distributed mode, but not in
 regular file system ?



-- 
Harsh J


Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Jay Vyas
No, my job does not write files directly to disk. It simply goes to some
web pages , reads data (in the reducer phase), and parses jsons into thrift
objects which are emitted via the standard MultipleOutputs API to hdfs
files.

Any idea why hadoop would throw the AlreadyBeingCreatedException ?

On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote:

 Jay,

 What does your job do? Create files directly on HDFS? If so, do you
 follow this method?:

 http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

 A local filesystem may not complain if you re-create an existent file.
 HDFS' behavior here is different. This simple Python test is what I
 mean:
  a = open('a', 'w')
  a.write('f')
  b = open('a', 'w')
  b.write('s')
  a.close(), b.close()
  open('a').read()
 's'

 Hence it is best to use the FileOutputCommitter framework as detailed
 in the mentioned link.

 On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote:
  Hi guys:
 
  I have a map reduce job that runs normally on local file system from
  eclipse, *but* it fails on HDFS running in psuedo distributed mode.
 
  The exception I see is
 
  *org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*
 
 
  Any thoughts on why this might occur in psuedo distributed mode, but not
 in
  regular file system ?



 --
 Harsh J




-- 
Jay Vyas
MMSB/UCHC


HADOOP_OPTS to tasks

2012-04-02 Thread Stijn De Weirdt

hi all,

is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the 
java processes running as child of tasktracker)? the tasktracker process 
uses them correctly.


is there a way to set general java options for each started task?


many thanks,

stijn


Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Harsh J
Jay,

Without seeing the whole stack trace all I can say as cause for that
exception from a job is:

1. You're using threads and the API components you are using isn't
thread safe in your version of Hadoop.
2. Files are being written out to HDFS directories without following
the OC rules. (This is negated, per your response).

On Mon, Apr 2, 2012 at 7:35 PM, Jay Vyas jayunit...@gmail.com wrote:
 No, my job does not write files directly to disk. It simply goes to some
 web pages , reads data (in the reducer phase), and parses jsons into thrift
 objects which are emitted via the standard MultipleOutputs API to hdfs
 files.

 Any idea why hadoop would throw the AlreadyBeingCreatedException ?

 On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote:

 Jay,

 What does your job do? Create files directly on HDFS? If so, do you
 follow this method?:

 http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F

 A local filesystem may not complain if you re-create an existent file.
 HDFS' behavior here is different. This simple Python test is what I
 mean:
  a = open('a', 'w')
  a.write('f')
  b = open('a', 'w')
  b.write('s')
  a.close(), b.close()
  open('a').read()
 's'

 Hence it is best to use the FileOutputCommitter framework as detailed
 in the mentioned link.

 On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote:
  Hi guys:
 
  I have a map reduce job that runs normally on local file system from
  eclipse, *but* it fails on HDFS running in psuedo distributed mode.
 
  The exception I see is
 
  *org.apache.hadoop.ipc.RemoteException:
  org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*
 
 
  Any thoughts on why this might occur in psuedo distributed mode, but not
 in
  regular file system ?



 --
 Harsh J




 --
 Jay Vyas
 MMSB/UCHC



-- 
Harsh J


Re: HADOOP_OPTS to tasks

2012-04-02 Thread Harsh J
HADOOP_OPTS isn't applied for Task JVMs.

For Task JVMs, set mapred.child.java.opts in mapred-site.xml (Or via
Configuration for per-job tuning), to the opts string you want it to
have. For example -Xmx200m -Dsomesysprop=abc.

On Mon, Apr 2, 2012 at 7:47 PM, Stijn De Weirdt stijn.dewei...@ugent.be wrote:
 hi all,

 is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the
 java processes running as child of tasktracker)? the tasktracker process
 uses them correctly.

 is there a way to set general java options for each started task?


 many thanks,

 stijn



-- 
Harsh J


Re: getting NullPointerException while running Word cont example

2012-04-02 Thread Sujit Dhamale
Can some one please look in to below issue ??
Thanks in Advance

On Wed, Mar 7, 2012 at 9:09 AM, Sujit Dhamale sujitdhamal...@gmail.comwrote:

 Hadoop version : hadoop-0.20.203.0rc1.tar
 Operaring Syatem : ubuntu 11.10



 On Wed, Mar 7, 2012 at 12:19 AM, Harsh J ha...@cloudera.com wrote:

 Hi Sujit,

 Please also tell us which version/distribution of Hadoop is this?

 On Tue, Mar 6, 2012 at 11:27 PM, Sujit Dhamale sujitdhamal...@gmail.com
 wrote:
  Hi,
 
  I am new to Hadoop., i install Hadoop as per
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
 
 http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluste
 
 
 
  while running Word cont example i am getting *NullPointerException
 
  *can some one please look in to this issue ?*
 
  *Thanks in Advance*  !!!
 
  *
 
 
  duser@sujit:~/Desktop/hadoop$ bin/hadoop dfs -ls /user/hduser/data
  Found 3 items
  -rw-r--r--   1 hduser supergroup 674566 2012-03-06 23:04
  /user/hduser/data/pg20417.txt
  -rw-r--r--   1 hduser supergroup1573150 2012-03-06 23:04
  /user/hduser/data/pg4300.txt
  -rw-r--r--   1 hduser supergroup1423801 2012-03-06 23:04
  /user/hduser/data/pg5000.txt
 
  hduser@sujit:~/Desktop/hadoop$ bin/hadoop jar hadoop*examples*.jar
  wordcount /user/hduser/data /user/hduser/gutenberg-outputd
 
  12/03/06 23:14:33 INFO input.FileInputFormat: Total input paths to
 process
  : 3
  12/03/06 23:14:33 INFO mapred.JobClient: Running job:
 job_201203062221_0002
  12/03/06 23:14:34 INFO mapred.JobClient:  map 0% reduce 0%
  12/03/06 23:14:49 INFO mapred.JobClient:  map 66% reduce 0%
  12/03/06 23:14:55 INFO mapred.JobClient:  map 100% reduce 0%
  12/03/06 23:14:58 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_0, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:07 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_1, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:16 INFO mapred.JobClient: Task Id :
  attempt_201203062221_0002_r_00_2, Status : FAILED
  Error: java.lang.NullPointerException
 at
  java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:768)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.getMapCompletionEvents(ReduceTask.java:2900)
 at
 
 org.apache.hadoop.mapred.ReduceTask$ReduceCopier$GetMapEventsThread.run(ReduceTask.java:2820)
 
  12/03/06 23:15:31 INFO mapred.JobClient: Job complete:
 job_201203062221_0002
  12/03/06 23:15:31 INFO mapred.JobClient: Counters: 20
  12/03/06 23:15:31 INFO mapred.JobClient:   Job Counters
  12/03/06 23:15:31 INFO mapred.JobClient: Launched reduce tasks=4
  12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=22084
  12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
  reduces waiting after reserving slots (ms)=0
  12/03/06 23:15:31 INFO mapred.JobClient: Total time spent by all
 maps
  waiting after reserving slots (ms)=0
  12/03/06 23:15:31 INFO mapred.JobClient: Launched map tasks=3
  12/03/06 23:15:31 INFO mapred.JobClient: Data-local map tasks=3
  12/03/06 23:15:31 INFO mapred.JobClient: Failed reduce tasks=1
  12/03/06 23:15:31 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=16799
  12/03/06 23:15:31 INFO mapred.JobClient:   FileSystemCounters
  12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_READ=740520
  12/03/06 23:15:31 INFO mapred.JobClient: HDFS_BYTES_READ=3671863
  12/03/06 23:15:31 INFO mapred.JobClient: FILE_BYTES_WRITTEN=2278287
  12/03/06 23:15:31 INFO mapred.JobClient:   File Input Format Counters
  12/03/06 23:15:31 INFO mapred.JobClient: Bytes Read=3671517
  12/03/06 23:15:31 INFO mapred.JobClient:   Map-Reduce Framework
  12/03/06 23:15:31 INFO mapred.JobClient: Map output materialized
  bytes=1474341
  12/03/06 23:15:31 INFO mapred.JobClient: Combine output
 records=102322
  12/03/06 23:15:31 INFO mapred.JobClient: Map input records=77932
  12/03/06 23:15:31 INFO mapred.JobClient: Spilled Records=153640
  12/03/06 23:15:31 INFO mapred.JobClient: Map output bytes=6076095
  12/03/06 23:15:31 INFO mapred.JobClient: Combine input
 records=629172
  12/03/06 23:15:31 INFO mapred.JobClient: Map output records=629172
  12/03/06 23:15:31 INFO mapred.JobClient: 

Re: org.apache.hadoop.ipc.RemoteException: org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:'

2012-04-02 Thread Jay Vyas
Thanks J : just curious about how you came to hypothesize (1) (i.e.
regarding the fact that threads and the
API componentns arent thread safe in my hadoop version).

I think thats a really good guess, and I would like to be able to make
those sorts of intelligent hypotheses
myself.  Any reading you can point me to for further enlightement ?

On Mon, Apr 2, 2012 at 3:16 PM, Harsh J ha...@cloudera.com wrote:

 Jay,

 Without seeing the whole stack trace all I can say as cause for that
 exception from a job is:

 1. You're using threads and the API components you are using isn't
 thread safe in your version of Hadoop.
 2. Files are being written out to HDFS directories without following
 the OC rules. (This is negated, per your response).

 On Mon, Apr 2, 2012 at 7:35 PM, Jay Vyas jayunit...@gmail.com wrote:
  No, my job does not write files directly to disk. It simply goes to some
  web pages , reads data (in the reducer phase), and parses jsons into
 thrift
  objects which are emitted via the standard MultipleOutputs API to hdfs
  files.
 
  Any idea why hadoop would throw the AlreadyBeingCreatedException ?
 
  On Mon, Apr 2, 2012 at 2:52 PM, Harsh J ha...@cloudera.com wrote:
 
  Jay,
 
  What does your job do? Create files directly on HDFS? If so, do you
  follow this method?:
 
 
 http://wiki.apache.org/hadoop/FAQ#Can_I_write_create.2BAC8-write-to_hdfs_files_directly_from_map.2BAC8-reduce_tasks.3F
 
  A local filesystem may not complain if you re-create an existent file.
  HDFS' behavior here is different. This simple Python test is what I
  mean:
   a = open('a', 'w')
   a.write('f')
   b = open('a', 'w')
   b.write('s')
   a.close(), b.close()
   open('a').read()
  's'
 
  Hence it is best to use the FileOutputCommitter framework as detailed
  in the mentioned link.
 
  On Mon, Apr 2, 2012 at 7:09 PM, Jay Vyas jayunit...@gmail.com wrote:
   Hi guys:
  
   I have a map reduce job that runs normally on local file system from
   eclipse, *but* it fails on HDFS running in psuedo distributed mode.
  
   The exception I see is
  
   *org.apache.hadoop.ipc.RemoteException:
   org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException:*
  
  
   Any thoughts on why this might occur in psuedo distributed mode, but
 not
  in
   regular file system ?
 
 
 
  --
  Harsh J
 
 
 
 
  --
  Jay Vyas
  MMSB/UCHC



 --
 Harsh J




-- 
Jay Vyas
MMSB/UCHC


Re: HBase bulk loader doing speculative execution when it set to false in mapred-site.xml

2012-04-02 Thread anil gupta
+common-user@hadoop.apache.org

Hi Harsh,

Thanks for the information.
Is there any way to differentiate between a client side property and
server-side property?or a Document which enlists whether a property is
server or client-side? Many times i have to speculate over this and try out
test runs.

Thanks,
Anil

On Fri, Mar 30, 2012 at 9:54 PM, Harsh J ha...@cloudera.com wrote:

 Anil,

 You can also disable speculative execution on a per-job basis. See

 http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapreduce/Job.html#setMapSpeculativeExecution(boolean)
 (Which is why it is called a client-sided property - it applies
 per-job).

 If HBase strongly recommends turning it off, HBase should also, by
 default, turn it off for its own offered jobs?

 On Sat, Mar 31, 2012 at 4:02 AM, anil gupta anilg...@buffalo.edu wrote:
  Hi Doug,
 
  Yes, that's why i had set that property as false in my mapred-site.xml.
  But, to my surprise i didnt know that setting that property would be
  useless for Hadoop jobs unless the mapred-site.xml is in classpath. The
  idea of client side property is a little confusing to me at present since
  there is no proper nomenclature for client side properties at present.
  Thanks for your reply.
 
  ~Anil
 
  On Fri, Mar 30, 2012 at 3:26 PM, Doug Meil 
 doug.m...@explorysmedical.comwrote:
 
 
  Speculative execution is on by default in Hadoop.  One of the
 Performance
  recommendations in the Hbase RefGuide is to turn it off.
 
 
 
 
 
  On 3/30/12 6:12 PM, Jean-Daniel Cryans jdcry...@apache.org wrote:
 
  Well that's not an HBase configuration, that's Hadoop. I'm not sure if
  this is listed anywhere, maybe in the book.
  
  BTW usually HBase has a client somewhere in the same to indicate
  it's client side.
  
  J-D
  
  On Fri, Mar 30, 2012 at 3:08 PM, anil gupta anilg...@buffalo.edu
 wrote:
   Thanks for the quick reply, Jean. Is there any link where i can find
 the
   name of all client-side configuration for HBase?
  
   ~Anil
  
   On Fri, Mar 30, 2012 at 3:01 PM, Jean-Daniel Cryans
  jdcry...@apache.orgwrote:
  
   This is a client-side configuration so if your mapred-site.xml is
   _not_ on your classpath when you start the bulk load, it's not going
   to pick it up. So either have that file on your classpath, or put it
   in whatever other configuration file you have.
  
   J-D
  
   On Fri, Mar 30, 2012 at 2:52 PM, anil gupta anilgupt...@gmail.com
  wrote:
Hi All,
   
I am using cdh3u2. I ran HBase bulk loading with property
mapred.reduce.tasks.speculative.execution set to false in
mapred-site.xml. Still, i can see 6 killed task in Bulk Loading
 job
  and
after short analysis i realized that these jobs are killed because
   another
worker node completed the task, hence it means that speculative
  execution
is still on. Why the HBase Bulk loader is doing speculative
 execution
   when
i have set it to false in mapred-site.xml? Please let me know if
 i am
missing something over here.
   
--
Thanks  Regards,
Anil Gupta
  
  
  
  
   --
   Thanks  Regards,
   Anil Gupta
  
 
 
 
 
 
  --
  Thanks  Regards,
  Anil Gupta



 --
 Harsh J




-- 
Thanks  Regards,
Anil Gupta


Re: HADOOP_OPTS to tasks

2012-04-02 Thread Stijn De Weirdt

On 04/02/2012 04:18 PM, Harsh J wrote:

HADOOP_OPTS isn't applied for Task JVMs.

For Task JVMs, set mapred.child.java.opts in mapred-site.xml (Or via
Configuration for per-job tuning), to the opts string you want it to
have. For example -Xmx200m -Dsomesysprop=abc.

thanks!

stijn



On Mon, Apr 2, 2012 at 7:47 PM, Stijn De Weirdtstijn.dewei...@ugent.be  wrote:

hi all,

is it normal that HADOOP_OPTS are not passed to the actual tasks (ie the
java processes running as child of tasktracker)? the tasktracker process
uses them correctly.

is there a way to set general java options for each started task?


many thanks,

stijn








data distribution in HDFS

2012-04-02 Thread Stijn De Weirdt

hi all,

i'm just started to play around with hdfs+mapred. i'm currently playing 
with teragen/sort/validate to see if i understand all.


the test setup involves 5 nodes that all are tasktracker and datanode 
(and one node that is also jobtracker and namenode on top of that. (this 
one node is running both the namenode hadoop process as the datanode 
process)


when i do the in teragen run, the data is not distributed equally over 
all nodes. the node that is also namenode, get's a bigger portion of all 
the data. (as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50 
files of 1GB)



i use basic command line  teragen $((100*1000*1000)) 
/benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add 
the volumes in use by hdfs, it's actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in 
use. so this one datanode is seen as 2 nodes.


when i do ls on the filesystem, i see that teragen created 250MB files, 
the current hdfs blocksize is 64MB.


is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't 
use the full hdfs space for testing that way. also, since more IO comes 
to this one node, the performance isn't really balanced.


many thanks,

stijn


Re: data distribution in HDFS

2012-04-02 Thread Raj Vishwanathan
Stijn,

The first block of the data , is always stored in the local node. Assuming that 
you had a replication factor of 3, the node that generates the data will get 
about 10GB of data and the other 20GB will be distributed among other nodes.

Raj 






 From: Stijn De Weirdt stijn.dewei...@ugent.be
To: common-user@hadoop.apache.org 
Sent: Monday, April 2, 2012 9:54 AM
Subject: data distribution in HDFS
 
hi all,

i'm just started to play around with hdfs+mapred. i'm currently playing with 
teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode (and one 
node that is also jobtracker and namenode on top of that. (this one node is 
running both the namenode hadoop process as the datanode process)

when i do the in teragen run, the data is not distributed equally over all 
nodes. the node that is also namenode, get's a bigger portion of all the data. 
(as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50 files of 
1GB)


i use basic command line  teragen $((100*1000*1000)) /benchmarks/teragen, so i 
expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's 
actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so 
this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB files, the 
current hdfs blocksize is 64MB.

is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't use the 
full hdfs space for testing that way. also, since more IO comes to this one 
node, the performance isn't really balanced.

many thanks,

stijn




Re: Getting RemoteException: while copying data from Local machine to HDFS

2012-04-02 Thread Harsh J
Per your jps, you don't have a DataNode running.

 hduser@sujit:~/Desktop/data$ ups
 6022 NameNode
 7100 Jps
 6569 JobTracker
 6798 TaskTracker
 6491 SecondaryNameNode

Please read http://wiki.apache.org/hadoop/CouldOnlyBeReplicatedTo to
solve this. You most likely need to also read:
http://search-hadoop.com/m/l4JWggvLE2

On Mon, Apr 2, 2012 at 10:58 PM, Sujit Dhamale sujitdhamal...@gmail.com wrote:
 Getting RemoteException: while copying data from Local machine to HDFS

 Hadoop version : hadoop-0.20.203.0rc1.tar
 Operaring Syatem : ubuntu 11.10

 hduser@sujit:~/Desktop/data$ jps
 6022 NameNode
 7100 Jps
 6569 JobTracker
 6798 TaskTracker
 6491 SecondaryNameNode
 hduser@sujit:~/Desktop/data$



 hduser@sujit:~/Desktop/data$ ls
 pg20417.txt  pg4300.txt  pg5000.txt


 hduser@sujit:~/Desktop/hadoop/bin$ hadoop dfs -copyFromLocal
 /home/hduser/Desktop/data /user/hduser/data
 12/04/02 22:51:37 WARN hdfs.DFSClient: DataStreamer Exception:
 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead
 of 1
    at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
    at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377)

    at org.apache.hadoop.ipc.Client.call(Client.java:1030)
    at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:224)
    at $Proxy1.addBlock(Unknown Source)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82)
    at
 org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59)
    at $Proxy1.addBlock(Unknown Source)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.locateFollowingBlock(DFSClient.java:3104)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2975)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2255)
    at
 org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:2446)

 12/04/02 22:51:37 WARN hdfs.DFSClient: Error Recovery for block null bad
 datanode[0] nodes == null
 12/04/02 22:51:37 WARN hdfs.DFSClient: Could not get block locations.
 Source file /user/hduser/data/pg20417.txt - Aborting...
 copyFromLocal: java.io.IOException: File /user/hduser/data/pg20417.txt
 could only be replicated to 0 nodes, instead of 1
 12/04/02 22:51:37 ERROR hdfs.DFSClient: Exception closing file
 /user/hduser/data/pg20417.txt : org.apache.hadoop.ipc.RemoteException:
 java.io.IOException: File /user/hduser/data/pg20417.txt could only be
 replicated to 0 nodes, instead of 1
    at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
    at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:523)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1383)
    at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:1379)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:396)
    at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1059)
    at org.apache.hadoop.ipc.Server$Handler.run(Server.java:1377)

 org.apache.hadoop.ipc.RemoteException: java.io.IOException: File
 /user/hduser/data/pg20417.txt could only be replicated to 0 nodes, instead
 of 1
    at
 org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1417)
    at
 org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:596)
    at sun.reflect.GeneratedMethodAccessor6.invoke(Unknown Source)
    at
 

Re: data distribution in HDFS

2012-04-02 Thread Stijn De Weirdt

hi raj,

what is a local node? is it relative to the tasks that are started?


stijn

On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:

Stijn,

The first block of the data , is always stored in the local node. Assuming that 
you had a replication factor of 3, the node that generates the data will get 
about 10GB of data and the other 20GB will be distributed among other nodes.

Raj







From: Stijn De Weirdtstijn.dewei...@ugent.be
To: common-user@hadoop.apache.org
Sent: Monday, April 2, 2012 9:54 AM
Subject: data distribution in HDFS

hi all,

i'm just started to play around with hdfs+mapred. i'm currently playing with 
teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode (and one 
node that is also jobtracker and namenode on top of that. (this one node is 
running both the namenode hadoop process as the datanode process)

when i do the in teragen run, the data is not distributed equally over all 
nodes. the node that is also namenode, get's a bigger portion of all the data. 
(as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50 files of 
1GB)


i use basic command line  teragen $((100*1000*1000)) /benchmarks/teragen, so i 
expect 100M*0.1kb = 10GB of data. (if i add the volumes in use by hdfs, it's 
actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in use. so 
this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB files, the 
current hdfs blocksize is 64MB.

is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't use the 
full hdfs space for testing that way. also, since more IO comes to this one 
node, the performance isn't really balanced.

many thanks,

stijn







Re: data distribution in HDFS

2012-04-02 Thread Stijn De Weirdt

thanks serge.


is there a way to disable this feature (ie place first block always on 
local node)?
and is this because the local node is a datanode? or is there always a 
local node with datatransfers?


many thanks,

stijn


Local node is a node from where you are coping data from

If lets say you are using -copyFromLocal option


Regards
Serge

On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be  wrote:


hi raj,

what is a local node? is it relative to the tasks that are started?


stijn

On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:

Stijn,

The first block of the data , is always stored in the local node.
Assuming that you had a replication factor of 3, the node that generates
the data will get about 10GB of data and the other 20GB will be
distributed among other nodes.

Raj







From: Stijn De Weirdtstijn.dewei...@ugent.be
To: common-user@hadoop.apache.org
Sent: Monday, April 2, 2012 9:54 AM
Subject: data distribution in HDFS

hi all,

i'm just started to play around with hdfs+mapred. i'm currently
playing with teragen/sort/validate to see if i understand all.

the test setup involves 5 nodes that all are tasktracker and datanode
(and one node that is also jobtracker and namenode on top of that.
(this one node is running both the namenode hadoop process as the
datanode process)

when i do the in teragen run, the data is not distributed equally over
all nodes. the node that is also namenode, get's a bigger portion of
all the data. (as seen by df on the nodes and by using dsfadmin -report)
i also get this distribution when i ran the TestDFSIO write test (50
files of 1GB)


i use basic command line  teragen $((100*1000*1000))
/benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
the volumes in use by hdfs, it's actually quite a bit more.)
4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
use. so this one datanode is seen as 2 nodes.

when i do ls on the filesystem, i see that teragen created 250MB
files, the current hdfs blocksize is 64MB.

is there a reason why one datanode is preferred over the others.
it is annoying since the terasort output behaves the same, and i can't
use the full hdfs space for testing that way. also, since more IO comes
to this one node, the performance isn't really balanced.

many thanks,

stijn












Re: data distribution in HDFS

2012-04-02 Thread Raj Vishwanathan
AFAIK there is no way to disable this feature . This is an optimization. It 
happens because in your case the node generating the data is also a data node.

Raj




 From: Stijn De Weirdt stijn.dewei...@ugent.be
To: common-user@hadoop.apache.org 
Sent: Monday, April 2, 2012 12:18 PM
Subject: Re: data distribution in HDFS
 
thanks serge.


is there a way to disable this feature (ie place first block always on 
local node)?
and is this because the local node is a datanode? or is there always a 
local node with datatransfers?

many thanks,

stijn

 Local node is a node from where you are coping data from

 If lets say you are using -copyFromLocal option


 Regards
 Serge

 On 4/2/12 11:53 AM, Stijn De Weirdtstijn.dewei...@ugent.be  wrote:

 hi raj,

 what is a local node? is it relative to the tasks that are started?


 stijn

 On 04/02/2012 07:28 PM, Raj Vishwanathan wrote:
 Stijn,

 The first block of the data , is always stored in the local node.
 Assuming that you had a replication factor of 3, the node that generates
 the data will get about 10GB of data and the other 20GB will be
 distributed among other nodes.

 Raj





 
 From: Stijn De Weirdtstijn.dewei...@ugent.be
 To: common-user@hadoop.apache.org
 Sent: Monday, April 2, 2012 9:54 AM
 Subject: data distribution in HDFS

 hi all,

 i'm just started to play around with hdfs+mapred. i'm currently
 playing with teragen/sort/validate to see if i understand all.

 the test setup involves 5 nodes that all are tasktracker and datanode
 (and one node that is also jobtracker and namenode on top of that.
 (this one node is running both the namenode hadoop process as the
 datanode process)

 when i do the in teragen run, the data is not distributed equally over
 all nodes. the node that is also namenode, get's a bigger portion of
 all the data. (as seen by df on the nodes and by using dsfadmin -report)
 i also get this distribution when i ran the TestDFSIO write test (50
 files of 1GB)


 i use basic command line  teragen $((100*1000*1000))
 /benchmarks/teragen, so i expect 100M*0.1kb = 10GB of data. (if i add
 the volumes in use by hdfs, it's actually quite a bit more.)
 4 data nodes are using 4.2-4.8GB, and the data+namenode has 9.4GB in
 use. so this one datanode is seen as 2 nodes.

 when i do ls on the filesystem, i see that teragen created 250MB
 files, the current hdfs blocksize is 64MB.

 is there a reason why one datanode is preferred over the others.
 it is annoying since the terasort output behaves the same, and i can't
 use the full hdfs space for testing that way. also, since more IO comes
 to this one node, the performance isn't really balanced.

 many thanks,

 stijn











Compression codec org.apache.hadoop.io.compress.DeflateCodec not found.

2012-04-02 Thread Eli Finkelshteyn

Hi Folks,
A coworker of mine recently setup a new CDH3 cluster with 4 machines (3 
data nodes, one namenode that doubles as a jobtracker). I started 
looking through it using hadoop fs -ls, and that went fine with 
everything displaying alright. Next, I decided to test out some simple 
pig jobs. Each of these worked fine on my development pseudo cluster, 
but failed on the new CDH3 cluster with the exact same erro:


*java.lang.IllegalArgumentException: Compression codec 
org.apache.hadoop.io.compress.DeflateCodec not found.

*

This also only happened when trying to process .gz files, and it 
happened even when I just tried to load and dump one. I figured this 
could be a problem with compression configs being manually overwritten 
in core-site.xml, but that file didn't have any mention of compressions 
on any of the boxes in the CDH3 cluster. I looked at each box 
individually, and all the proper jars seem to be there, so now I'm at a 
bit of a loss. Any ideas what the problem could be?


Eli



Re: Compression codec org.apache.hadoop.io.compress.DeflateCodec not found.

2012-04-02 Thread Harsh J
Hi Eli,

Moving this to cdh-u...@cloudera.org as its a CDH specific question.
You'll get better answers from the community there. You are CC'd but
to subscribe to the CDH users community, head to
https://groups.google.com/a/cloudera.org/forum/#!forum/cdh-user. I've
bcc'd common-user@ here.

What you may be hitting here is caused by version mismatch in client
vs. server. See
https://ccp.cloudera.com/display/CDHDOC/Known+Issues+and+Work+Arounds+in+CDH3#KnownIssuesandWorkAroundsinCDH3-Pig
(Point #2, but it may not be just Pig/Hive-specific)

On Tue, Apr 3, 2012 at 3:54 AM, Eli Finkelshteyn iefin...@gmail.com wrote:
 Hi Folks,
 A coworker of mine recently setup a new CDH3 cluster with 4 machines (3 data
 nodes, one namenode that doubles as a jobtracker). I started looking through
 it using hadoop fs -ls, and that went fine with everything displaying
 alright. Next, I decided to test out some simple pig jobs. Each of these
 worked fine on my development pseudo cluster, but failed on the new CDH3
 cluster with the exact same erro:

 *java.lang.IllegalArgumentException: Compression codec
 org.apache.hadoop.io.compress.DeflateCodec not found.

 *

 This also only happened when trying to process .gz files, and it happened
 even when I just tried to load and dump one. I figured this could be a
 problem with compression configs being manually overwritten in
 core-site.xml, but that file didn't have any mention of compressions on any
 of the boxes in the CDH3 cluster. I looked at each box individually, and all
 the proper jars seem to be there, so now I'm at a bit of a loss. Any ideas
 what the problem could be?

 Eli




-- 
Harsh J