Socket does not have a channel

2013-03-05 Thread Subroto
Hi

java.lang.IllegalStateException: Socket 
Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at 
org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)

While accessing the HDFS  I keep getting the above mentioned error.
Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
I would like to know what exactly is the problem? Is it a problem/bug in hadoop 
?
Is there is JIRA ticket for this?? 


Cheers,
Subroto Sanyal

Re: Need help optimizing reducer

2013-03-05 Thread Mahesh Balija
The reason why the reducer is fast upto 66% is be because of the Sorting
and Shuffling phase of the reduce and when the actual task is NOT yet
started.

The reduce side is divided into 3 phases of 33~% each - shuffle (fetch
data), sort and finally user-code (reduce). That is why your reduce might
be faster upto 66%. In order to speed up your program you may either have
to have more number of reducers or make your reducer code as optimized as
possible.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.com wrote:

 Hi all,

 I have 1 reducer and I have around 600 thousand unique keys coming to it.
 The total data is only around 30 mb.
 My logic doesn't allow me to have more than 1 reducer.
 It's taking too long to complete, around 2 hours. (till 66% it's fast then
 it slows down/ I don't really think it has started doing anything till 66%
 but then why does it show like that?).
 Are there any job execution parameters that can help improve reducer
 performace?
 Any suggestions to improve things when we have to live with just one
 reducer?

 thanks,
 Austin



Re: Need help optimizing reducer

2013-03-05 Thread Fatih Haltas
Hi Austin,

I am not sure whether you had  this kind of mistake or not but in any case,
I would like to state:
that you might be trying to read whole input values,(corresponding key
values) to reducer function from beginning to end(which is the output value
of mapper) while merging them into one output.

If you can send reducer code, you may get more useful replies.




On Tue, Mar 5, 2013 at 1:00 PM, Mahesh Balija balijamahesh@gmail.comwrote:

 The reason why the reducer is fast upto 66% is be because of the Sorting
 and Shuffling phase of the reduce and when the actual task is NOT yet
 started.

 The reduce side is divided into 3 phases of 33~% each - shuffle (fetch
 data), sort and finally user-code (reduce). That is why your reduce might
 be faster upto 66%. In order to speed up your program you may either have
 to have more number of reducers or make your reducer code as optimized as
 possible.

 Best,
 Mahesh Balija,
 Calsoft Labs.


 On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.comwrote:

 Hi all,

 I have 1 reducer and I have around 600 thousand unique keys coming to it.
 The total data is only around 30 mb.
 My logic doesn't allow me to have more than 1 reducer.
 It's taking too long to complete, around 2 hours. (till 66% it's fast
 then it slows down/ I don't really think it has started doing anything till
 66% but then why does it show like that?).
 Are there any job execution parameters that can help improve reducer
 performace?
 Any suggestions to improve things when we have to live with just one
 reducer?

 thanks,
 Austin





Re: Need help optimizing reducer

2013-03-05 Thread Fatih Haltas
I mean,
while trying to add newcoming reducer input value to already merged input
values,to construct whole input values of corresponding key value to
reducer, you might be reading every input values(which are output value of
mapper) from beginning to end.


On Tue, Mar 5, 2013 at 1:46 PM, Fatih Haltas fatih.hal...@nyu.edu wrote:

 Hi Austin,

 I am not sure whether you had  this kind of mistake or not but in any
 case, I would like to state:
 that you might be trying to read whole input values,(corresponding key
 values) to reducer function from beginning to end(which is the output value
 of mapper) while merging them into one output.

 If you can send reducer code, you may get more useful replies.




 On Tue, Mar 5, 2013 at 1:00 PM, Mahesh Balija 
 balijamahesh@gmail.comwrote:

 The reason why the reducer is fast upto 66% is be because of the Sorting
 and Shuffling phase of the reduce and when the actual task is NOT yet
 started.

 The reduce side is divided into 3 phases of 33~% each - shuffle (fetch
 data), sort and finally user-code (reduce). That is why your reduce
 might be faster upto 66%. In order to speed up your program you may either
 have to have more number of reducers or make your reducer code as optimized
 as possible.

 Best,
 Mahesh Balija,
 Calsoft Labs.


 On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.comwrote:

 Hi all,

 I have 1 reducer and I have around 600 thousand unique keys coming to
 it. The total data is only around 30 mb.
 My logic doesn't allow me to have more than 1 reducer.
 It's taking too long to complete, around 2 hours. (till 66% it's fast
 then it slows down/ I don't really think it has started doing anything till
 66% but then why does it show like that?).
 Are there any job execution parameters that can help improve reducer
 performace?
 Any suggestions to improve things when we have to live with just one
 reducer?

 thanks,
 Austin






Hadoop cluster setup - could not see second datanode

2013-03-05 Thread AMARNATH, Balachandar
Thanks for the information,

Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, 
and a separate data node. I use the following configuration for my hdfs-site.xml

configuration

  property
namefs.default.name/name
valuelocalhost:9000/value
  /property

  property
namedfs.data.dir/name
value/home/bala/data/value
  /property

  property
namedfs.name.dir/name
value/home/bala/name/value
  /property
/configuration


In namenode, I have added the datanode hostnames (machine1 and machine2).
When I do 'start-all.sh', I see the log that the data node is starting in both 
the machines but I went to the browser in the namenode, I see only one live 
node. (That is the namenode which is configured as datanode)

Any hint here will help me


With regards
Bala





From: Mahesh Balija [mailto:balijamahesh@gmail.com]
Sent: 05 March 2013 14:15
To: user@hadoop.apache.org
Subject: Re: Hadoop file system

You can be able to use Hdfs alone in the distributed mode to fulfill your 
requirement.
Hdfs has the Filesystem java api through which you can interact with the HDFS 
from your client.
HDFS is good if you have less number of files with huge size rather than you 
having many files with small size.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote:

Hi,

I am new to hdfs. In my java application, I need to perform 'similar operation' 
over large number of files. I would like to store those files in distributed 
machines. I don't think, I will need map reduce paradigm. But however I would 
like to use HDFS for file storage and access. Is it possible (or nice idea) to 
use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS 
so that I can read/write in distributed environment ? Any thoughts here will be 
helpful.


With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.


The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.



basic question about rack awareness and computation migration

2013-03-05 Thread Julian Bui
Hi hadoop users,

I'm trying to find out if computation migration is something the developer
needs to worry about or if it's supposed to be hidden.

I would like to use hadoop to take in a list of image paths in the hdfs and
then have each task compress these large, raw images into something much
smaller - say jpeg  files.

Input: list of paths
Output: compressed jpeg

Since I don't really need a reduce task (I'm more using hadoop for its
reliability and orchestration aspects), my mapper ought to just take the
list of image paths and then work on them.  As I understand it, each image
will likely be on multiple data nodes.

My question is how will each mapper task migrate the computation to the
data nodes?  I recall reading that the namenode is supposed to deal with
this.  Is it hidden from the developer?  Or as the developer, do I need to
discover where the data lies and then migrate the task to that node?  Since
my input is just a list of paths, it seems like the namenode couldn't
really do this for me.

Another question: Where can I find out more about this?  I've looked up
rack awareness and computation migration but haven't really found much
code relating to either one - leading me to believe I'm not supposed to
have to write code to deal with this.

Anyway, could someone please help me out or set me straight on this?

Thanks,
-Julian


RE: Hadoop cluster setup - could not see second datanode

2013-03-05 Thread AMARNATH, Balachandar
I fixed it the below issue :)


Regards
Bala

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 05 March 2013 17:05
To: user@hadoop.apache.org
Subject: Hadoop cluster setup - could not see second datanode

Thanks for the information,

Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, 
and a separate data node. I use the following configuration for my hdfs-site.xml

configuration

  property
namefs.default.name/name
valuelocalhost:9000/value
  /property

  property
namedfs.data.dir/name
value/home/bala/data/value
  /property

  property
namedfs.name.dir/name
value/home/bala/name/value
  /property
/configuration


In namenode, I have added the datanode hostnames (machine1 and machine2).
When I do 'start-all.sh', I see the log that the data node is starting in both 
the machines but I went to the browser in the namenode, I see only one live 
node. (That is the namenode which is configured as datanode)

Any hint here will help me


With regards
Bala





From: Mahesh Balija [mailto:balijamahesh@gmail.com]
Sent: 05 March 2013 14:15
To: user@hadoop.apache.org
Subject: Re: Hadoop file system

You can be able to use Hdfs alone in the distributed mode to fulfill your 
requirement.
Hdfs has the Filesystem java api through which you can interact with the HDFS 
from your client.
HDFS is good if you have less number of files with huge size rather than you 
having many files with small size.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote:

Hi,

I am new to hdfs. In my java application, I need to perform 'similar operation' 
over large number of files. I would like to store those files in distributed 
machines. I don't think, I will need map reduce paradigm. But however I would 
like to use HDFS for file storage and access. Is it possible (or nice idea) to 
use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS 
so that I can read/write in distributed environment ? Any thoughts here will be 
helpful.


With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.


The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.

The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.



JobTracker client - max connections

2013-03-05 Thread Amit Sela
Hi all,

I'm implementing an API over the JobTracker client - JobClient.
My plan is to have a pool of JobClient objects that will expose the ability
to submit jobs, poll status etc.

My question is: Should I set a maximum pool size ? How many connections
aree too many connection for the JobTracker ? any suggestions for what Pool
to use ?

Thanks,
Amit.


S3N copy creating recursive folders

2013-03-05 Thread Subroto
Hi,

I am using Hadoop 1.0.3 and trying to execute:
hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData

This ends up with:
cp: java.io.IOException: mkdirs: Pathname too long.  Limit 8000 characters, 
1000 levels.

When I try to list the folder recursively /test/srcData: it lists 998 folders 
like:
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 /test/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData/srcData/srcData

Is there a problem with s3n filesystem ??

Cheers,
Subroto Sanyal

signature.asc
Description: Message signed with OpenPGP using GPGMail


Re:S3N copy creating recursive folders

2013-03-05 Thread ????????????
Hi Subroto,


I didn't use the s3n filesystem.But  from the output cp: java.io.IOException: 
mkdirs: Pathname too long.  Limit 8000 characters, 1000 levels., I think this 
is because the problem of the path. Is the path longer than 8000 characters or 
the level is more than 1000?
You only have 998 folders.Maybe the last one is more than 8000 characters.Why 
not count the last one's length?


BRs//Julian










-- Original --
From:  Subrotossan...@datameer.com;
Date:  Tue, Mar 5, 2013 10:22 PM
To:  useruser@hadoop.apache.org; 

Subject:  S3N copy creating recursive folders



Hi,

I am using Hadoop 1.0.3 and trying to execute:
hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData

This ends up with:
cp: java.io.IOException: mkdirs: Pathname too long.  Limit 8000 characters, 
1000 levels.

When I try to list the folder recursively /test/srcData: it lists 998 folders 
like:
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 /test/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData/srcData
drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
/test/srcData/srcData/srcData/srcData/srcData/srcData

Is there a problem with s3n filesystem ??

Cheers,
Subroto Sanyal

Re: S3N copy creating recursive folders

2013-03-05 Thread Subroto
Hi,

Its not because there are too many recursive folders in S3 bucket; in-fact 
there is no recursive folder in the source.
If I list the S3 bucket with Native S3 tools I can find a file srcData with 
size 0 in the folder srcData. 
The copy command keeps on creating folder  /test/srcData/srcData/srcData (keep 
on appending srcData).

Cheers,
Subroto Sanyal

On Mar 5, 2013, at 3:32 PM,  wrote:

 Hi Subroto,
 
 I didn't use the s3n filesystem.But  from the output cp: 
 java.io.IOException: mkdirs: Pathname too long.  Limit 8000 characters, 1000 
 levels., I think this is because the problem of the path. Is the path longer 
 than 8000 characters or the level is more than 1000?
 You only have 998 folders.Maybe the last one is more than 8000 characters.Why 
 not count the last one's length?
 
 BRs//Julian
 
 
 
 
 
 -- Original --
 From:  Subrotossan...@datameer.com;
 Date:  Tue, Mar 5, 2013 10:22 PM
 To:  useruser@hadoop.apache.org;
 Subject:  S3N copy creating recursive folders
 
 Hi,
 
 I am using Hadoop 1.0.3 and trying to execute:
 hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData
 
 This ends up with:
 cp: java.io.IOException: mkdirs: Pathname too long.  Limit 8000 characters, 
 1000 levels.
 
 When I try to list the folder recursively /test/srcData: it lists 998 folders 
 like:
 drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
 /test/srcData/srcData
 drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
 /test/srcData/srcData/srcData
 drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
 /test/srcData/srcData/srcData/srcData
 drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
 /test/srcData/srcData/srcData/srcData/srcData
 drwxr-xr-x   - root supergroup  0 2013-03-05 08:49 
 /test/srcData/srcData/srcData/srcData/srcData/srcData
 
 Is there a problem with s3n filesystem ??
 
 Cheers,
 Subroto Sanyal



Re:RE: Hadoop cluster setup - could not see second datanode

2013-03-05 Thread ????????????
Hello,
Can  Namenode and several datanodes exist in one machine?
I only have one PC. I want to configure it like this way.


BRs//Julian





-- Original --
From:  AMARNATH, Balachandarbalachandar.amarn...@airbus.com;
Date:  Tue, Mar 5, 2013 07:55 PM
To:  user@hadoop.apache.orguser@hadoop.apache.org; 

Subject:  RE: Hadoop cluster setup - could not see second datanode




I fixed it the below issue J

 

 

Regards

Bala

 

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] 
Sent: 05 March 2013 17:05
To: user@hadoop.apache.org
Subject: Hadoop cluster setup - could not see second datanode



 

Thanks for the information,

 

Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, 
and a separate data node. I use the following configuration for my hdfs-site.xml

 

configuration

 

  property

namefs.default.name/name

valuelocalhost:9000/value

  /property

 

  property

namedfs.data.dir/name

value/home/bala/data/value

  /property

 

  property

namedfs.name.dir/name

value/home/bala/name/value

  /property

/configuration

 

 

In namenode, I have added the datanode hostnames (machine1 and machine2).

When I do ??start-all.sh??, I see the log that the data node is starting in 
both the machines but I went to the browser in the namenode, I see only one 
live node. (That is the namenode which is configured as datanode)

 

Any hint here will help me

 

 

With regards

Bala

 

 

 

 

 

From: Mahesh Balija [mailto:balijamahesh@gmail.com] 
Sent: 05 March 2013 14:15
To: user@hadoop.apache.org
Subject: Re: Hadoop file system


 

You can be able to use Hdfs alone in the distributed mode to fulfill your 
requirement.
Hdfs has the Filesystem java api through which you can interact with the HDFS 
from your client.
HDFS is good if you have less number of files with huge size rather than you 
having many files with small size.

Best,
Mahesh Balija,
Calsoft Labs.

On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.com wrote:

 


Hi,


 


I am new to hdfs. In my java application, I need to perform ??similar 
operation?? over large number of files. I would like to store those files in 
distributed machines. I don??t think, I will need map reduce paradigm. But 
however I would like to use HDFS for file storage and access. Is it possible 
(or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available 
to work with HDFS so that I can read/write in distributed environment ? Any 
thoughts here will be helpful.


 


 


With thanks and regards


Balachandar


 


 


 

The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.If you are not the intended recipient, please 
notify Airbus immediately and delete this e-mail.Airbus cannot accept any 
responsibility for the accuracy or completeness of this e-mail as it has been 
sent over public networks. If you have any concerns over the content of this 
message or its Accuracy or Integrity, please contact Airbus immediately.All 
outgoing e-mails from Airbus are checked using regularly updated virus scanning 
software but you should take whatever measures you deem to be appropriate to 
ensure that this message and any attachments are virus free.


 
The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.If you are not the intended recipient, please 
notify Airbus immediately and delete this e-mail.Airbus cannot accept any 
responsibility for the accuracy or completeness of this e-mail as it has been 
sent over public networks. If you have any concerns over the content of this 
message or its Accuracy or Integrity, please contact Airbus immediately.All 
outgoing e-mails from Airbus are checked using regularly updated virus scanning 
software but you should take whatever measures you deem to be appropriate to 
ensure that this message and any attachments are virus free.
The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised. If you are not the intended recipient, please 
notify Airbus immediately and delete this e-mail. Airbus cannot accept any 
responsibility for the accuracy or completeness of this e-mail as it has been 
sent over public networks. If you have any concerns over the content of this 
message or its Accuracy or Integrity, please contact Airbus immediately. All 
outgoing e-mails from Airbus are checked using regularly updated virus scanning 
software but you should take whatever measures you deem to be appropriate to 
ensure that this message and any attachments are virus free.

Re:Socket does not have a channel

2013-03-05 Thread ????????????
Hi,
Which revision of hadoop?
and  what's the  situation  to report the Exception?
BRs//Julian


-- Original --
From:  Subrotossan...@datameer.com;
Date:  Tue, Mar 5, 2013 04:46 PM
To:  useruser@hadoop.apache.org; 

Subject:  Socket does not have a channel



Hi


java.lang.IllegalStateException: Socket 
Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at 
org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)


While accessing the HDFS  I keep getting the above mentioned error.
Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
I would like to know what exactly is the problem? Is it a problem/bug in hadoop 
?
Is there is JIRA ticket for this?? 




Cheers,
Subroto Sanyal

Re: Socket does not have a channel

2013-03-05 Thread Subroto
Hi Julian,

This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0.

Cheers,
Subroto Sanyal
On Mar 5, 2013, at 3:50 PM,  wrote:

 Hi,
 Which revision of hadoop?
 and  what's the  situation  to report the Exception?
 BRs//Julian
 
 -- Original --
 From:  Subrotossan...@datameer.com;
 Date:  Tue, Mar 5, 2013 04:46 PM
 To:  useruser@hadoop.apache.org;
 Subject:  Socket does not have a channel
 
 Hi
 
 java.lang.IllegalStateException: Socket 
 Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel
   at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:172)
   at 
 org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
   at 
 org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
   at 
 org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
   at 
 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)
 
 While accessing the HDFS  I keep getting the above mentioned error.
 Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
 I would like to know what exactly is the problem? Is it a problem/bug in 
 hadoop ?
 Is there is JIRA ticket for this??
 
 
 Cheers,
 Subroto Sanyal



?????? Socket does not have a channel

2013-03-05 Thread ????????????
Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such 
classes the log mentioned.Maybe you can read the code first.




--  --
??: Subrotossan...@datameer.com;
: 2013??3??5??(??) 10:56
??: useruser@hadoop.apache.org; 

: Re: Socket does not have a channel



Hi Julian,

This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0.


Cheers,
Subroto Sanyal
On Mar 5, 2013, at 3:50 PM,  wrote:

Hi,
Which revision of hadoop?
and  what's the  situation  to report the Exception?
BRs//Julian


-- Original --
From:  Subrotossan...@datameer.com;
Date:  Tue, Mar 5, 2013 04:46 PM
To:  useruser@hadoop.apache.org; 

Subject:  Socket does not have a channel



Hi


java.lang.IllegalStateException: Socket 
Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel
at 
com.google.common.base.Preconditions.checkState(Preconditions.java:172)
at 
org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
at 
org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
at 
org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
at 
org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
at 
org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)


While accessing the HDFS  I keep getting the above mentioned error.
Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
I would like to know what exactly is the problem? Is it a problem/bug in hadoop 
?
Is there is JIRA ticket for this?? 




Cheers,
Subroto Sanyal

Transpose

2013-03-05 Thread Mix Nin
Hi

I have data in a file as follows . There are 3 columns separated by
semicolon(;). Each column would have multiple values separated by comma
(,).

11,22,33;144,244,344;yny;

I need output data in below format. It is like transposing  values of each
column.

11 144 y
22 244 n
33 344 y

Can we write map reduce program to achieve this. Could you help on the code
on how to write.


Thanks


Re: Transpose

2013-03-05 Thread Michel Segel
Yes you can.
You read in the row in each iteration of Mapper.map()
Text input.
You then output 3 times to the collector one for each row of the matrix.

Spin,sort, and reduce as needed.

Sent from a remote device. Please excuse any typos...

Mike Segel

On Mar 5, 2013, at 9:11 AM, Mix Nin pig.mi...@gmail.com wrote:

 Hi
 
 I have data in a file as follows . There are 3 columns separated by 
 semicolon(;). Each column would have multiple values separated by comma (,). 
 
 11,22,33;144,244,344;yny;
 
 I need output data in below format. It is like transposing  values of each 
 column.
 
 11 144 y  
 22 244 n
 33 344 y
 
 Can we write map reduce program to achieve this. Could you help on the code 
 on how to write.
 
 
 Thanks


Re: Hadoop cluster setup - could not see second datanode

2013-03-05 Thread Robert Evans
Why would you need several data nodes?  It is simple to have one data node and 
one name node on the same machine.  I believe that you can make multiple data 
nodes run on the same machine, but it would take quite a bit of configuration 
work to do it, and it would only really be helpful for you to do some very 
specific testing involving multiple data nodes.

--Bobby

From: 卖报的小行家 85469...@qq.commailto:85469...@qq.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Tuesday, March 5, 2013 8:41 AM
To: user user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re:RE: Hadoop cluster setup - could not see second datanode

Hello,
Can  Namenode and several datanodes exist in one machine?
I only have one PC. I want to configure it like this way.

BRs//Julian


-- Original --
From:  AMARNATH, 
Balachandarbalachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com;
Date:  Tue, Mar 5, 2013 07:55 PM
To:  
user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org;
Subject:  RE: Hadoop cluster setup - could not see second datanode

I fixed it the below issue :)


Regards
Bala

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 05 March 2013 17:05
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Hadoop cluster setup - could not see second datanode

Thanks for the information,

Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, 
and a separate data node. I use the following configuration for my hdfs-site.xml

configuration

  property
namefs.default.name/name
valuelocalhost:9000/value
  /property

  property
namedfs.data.dir/name
value/home/bala/data/value
  /property

  property
namedfs.name.dir/name
value/home/bala/name/value
  /property
/configuration


In namenode, I have added the datanode hostnames (machine1 and machine2).
When I do ‘start-all.sh’, I see the log that the data node is starting in both 
the machines but I went to the browser in the namenode, I see only one live 
node. (That is the namenode which is configured as datanode)

Any hint here will help me


With regards
Bala





From: Mahesh Balija [mailto:balijamahesh@gmail.com]
Sent: 05 March 2013 14:15
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Hadoop file system

You can be able to use Hdfs alone in the distributed mode to fulfill your 
requirement.
Hdfs has the Filesystem java api through which you can interact with the HDFS 
from your client.
HDFS is good if you have less number of files with huge size rather than you 
having many files with small size.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote:

Hi,

I am new to hdfs. In my java application, I need to perform ‘similar operation’ 
over large number of files. I would like to store those files in distributed 
machines. I don’t think, I will need map reduce paradigm. But however I would 
like to use HDFS for file storage and access. Is it possible (or nice idea) to 
use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS 
so that I can read/write in distributed environment ? Any thoughts here will be 
helpful.


With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.


The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.

The information in this 

Re: Transpose

2013-03-05 Thread Sandy Ryza
Hi,

Essentially what you want to do is group your data points by their position
in the column, and have each reduce call construct the data for each row
into a row.  To have each record that the mapper processes be one of the
columns, you can use TextInputFormat with
conf.set(textinputformat.record.delimiter, ;).  Your mapper will
receive keys as LongWritables specifying the byte index into the input
file, and Text as values.  The mapper will tokenize the input string.

Emiting a map output for each data point in each column, you can then use
secondary sort to send the data to the right place in the right order (see
http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/).
Your composite key would look like (index of data point in column, which is
the row index; the LongWritable passed in as the map input key).  Each
reduce call would get all the points in a single row. You would sort/group
by row index, and within a reduce's values, sort by byte index so that
entries from earlier columns come before later ones.

Does that make sense?

Sandy

On Tue, Mar 5, 2013 at 7:11 AM, Mix Nin pig.mi...@gmail.com wrote:

 Hi

 I have data in a file as follows . There are 3 columns separated by
 semicolon(;). Each column would have multiple values separated by comma
 (,).

 11,22,33;144,244,344;yny;

 I need output data in below format. It is like transposing  values of each
 column.

 11 144 y
 22 244 n
 33 344 y

 Can we write map reduce program to achieve this. Could you help on the
 code on how to write.


 Thanks



Re: 回复: Socket does not have a channel

2013-03-05 Thread shashwat shriparv
Try setting dfs.client.use.legacy.blockreader to true



∞
Shashwat Shriparv



On Tue, Mar 5, 2013 at 8:39 PM, 卖报的小行家 85469...@qq.com wrote:

 Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such
 classes the log mentioned.Maybe you can read the code first.


 -- 原始邮件 --
 *发件人:* Subrotossan...@datameer.com;
 *发送时间:* 2013年3月5日(星期二) 晚上10:56
 *收件人:* useruser@hadoop.apache.org; **
 *主题:* Re: Socket does not have a channel

 Hi Julian,

 This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0.

 Cheers,
 Subroto Sanyal
 On Mar 5, 2013, at 3:50 PM, 卖报的小行家 wrote:

 Hi,
 Which revision of hadoop?
 and  what's the  situation  to report the Exception?
 BRs//Julian

 -- Original --
 *From: * Subrotossan...@datameer.com;
 *Date: * Tue, Mar 5, 2013 04:46 PM
 *To: * useruser@hadoop.apache.org; **
 *Subject: * Socket does not have a channel

 Hi

 java.lang.IllegalStateException: Socket 
 Socket[addr=/10.86.203.112,port=1004,localport=35170]
 does not have a channel
  at
 com.google.common.base.Preconditions.checkState(Preconditions.java:172)
  at
 org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
  at
 org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
  at
 org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
  at
 org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
  at
 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)

 While accessing the HDFS  I keep getting the above mentioned error.
 Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
 I would like to know what exactly is the problem? Is it a problem/bug in
 hadoop ?
 Is there is JIRA ticket for this??


 Cheers,
 Subroto Sanyal





How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread anton ashanin
I am trying to run all Hadoop servers on a single Ubuntu localhost. All
ports are open and my /etc/hosts file is

127.0.0.1   frigate frigate.domain.locallocalhost
# The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters

When trying to install cluster Cloudera manager fails with the following
messages:

Installation failed. Failed to receive heartbeat from agent.

I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my
provider. What configuration is missing?

Thanks!


Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread Jean-Marc Spaggiari
Hi Anton,

Can you try to add something like:
your.local.ip.addressyourhostname

into your hosts file?

Like:
192.168.1.2  masterserver

2013/3/5 anton ashanin anton.asha...@gmail.com:
 I am trying to run all Hadoop servers on a single Ubuntu localhost. All
 ports are open and my /etc/hosts file is

 127.0.0.1   frigate frigate.domain.locallocalhost
 # The following lines are desirable for IPv6 capable hosts
 ::1 ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters

 When trying to install cluster Cloudera manager fails with the following
 messages:

 Installation failed. Failed to receive heartbeat from agent.

 I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my
 provider. What configuration is missing?

 Thanks!


Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread anton ashanin
Jean, thanks for trying to help.
I get my IP address by DHCP. Every time I start my Ubuntu I possibly can
get a different IP address from my WiFi modem /router.
Will it be ok to add static address from  192.168.*.*  to /etc/hosts in
this case?



On Tue, Mar 5, 2013 at 9:47 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org
 wrote:

 Hi Anton,

 Can you try to add something like:
 your.local.ip.addressyourhostname

 into your hosts file?

 Like:
 192.168.1.2  masterserver

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  I am trying to run all Hadoop servers on a single Ubuntu localhost. All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my
  provider. What configuration is missing?
 
  Thanks!



Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread Suresh Srinivas
Can you please take this Cloudera mailing list?


On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.comwrote:

 I am trying to run all Hadoop servers on a single Ubuntu localhost. All
 ports are open and my /etc/hosts file is

 127.0.0.1   frigate frigate.domain.locallocalhost
 # The following lines are desirable for IPv6 capable hosts
 ::1 ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters

 When trying to install cluster Cloudera manager fails with the following
 messages:

 Installation failed. Failed to receive heartbeat from agent.

 I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my
 provider. What configuration is missing?

 Thanks!




-- 
http://hortonworks.com/download/


Re: 回复: Socket does not have a channel

2013-03-05 Thread Subroto
Hi Shashwat,

As already mentioned already in my mail setting 
dfs.client.use.legacy.blockreader to true fixes the problem.
This looks to be workaround or moreover disabling a feature.
Would like to know, what is the exact problem?

Cheers,
Subroto Sanyal
On Mar 5, 2013, at 6:33 PM, shashwat shriparv wrote:

 Try setting dfs.client.use.legacy.blockreader to true
 
 ∞
 Shashwat Shriparv
 
 
 
 On Tue, Mar 5, 2013 at 8:39 PM, 卖报的小行家 85469...@qq.com wrote:
 Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such 
 classes the log mentioned.Maybe you can read the code first.
 
 
 -- 原始邮件 --
 发件人: Subrotossan...@datameer.com;
 发送时间: 2013年3月5日(星期二) 晚上10:56
 收件人: useruser@hadoop.apache.org;
 主题: Re: Socket does not have a channel
 
 Hi Julian,
 
 This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0.
 
 Cheers,
 Subroto Sanyal
 On Mar 5, 2013, at 3:50 PM, 卖报的小行家 wrote:
 
 Hi,
 Which revision of hadoop?
 and  what's the  situation  to report the Exception?
 BRs//Julian
 
 -- Original --
 From:  Subrotossan...@datameer.com;
 Date:  Tue, Mar 5, 2013 04:46 PM
 To:  useruser@hadoop.apache.org;
 Subject:  Socket does not have a channel
 
 Hi
 
 java.lang.IllegalStateException: Socket 
 Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel
  at 
 com.google.common.base.Preconditions.checkState(Preconditions.java:172)
  at 
 org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83)
  at 
 org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432)
  at 
 org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82)
  at 
 org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832)
  at 
 org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444)
 
 While accessing the HDFS  I keep getting the above mentioned error.
 Setting the dfs.client.use.legacy.blockreader to true fixes the problem.
 I would like to know what exactly is the problem? Is it a problem/bug in 
 hadoop ?
 Is there is JIRA ticket for this??
 
 
 Cheers,
 Subroto Sanyal
 
 



Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread Jean-Marc Spaggiari
Moving to cdh-user, user@hadoop in BCC

Anton, can you just with the IP you have and see if it fixed the issue
before trying anything else?

JM

2013/3/5 anton ashanin anton.asha...@gmail.com:
 Jean, thanks for trying to help.
 I get my IP address by DHCP. Every time I start my Ubuntu I possibly can get
 a different IP address from my WiFi modem /router.
 Will it be ok to add static address from  192.168.*.*  to /etc/hosts in this
 case?



 On Tue, Mar 5, 2013 at 9:47 PM, Jean-Marc Spaggiari
 jean-m...@spaggiari.org wrote:

 Hi Anton,

 Can you try to add something like:
 your.local.ip.addressyourhostname

 into your hosts file?

 Like:
 192.168.1.2  masterserver

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  I am trying to run all Hadoop servers on a single Ubuntu localhost. All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to
  my
  provider. What configuration is missing?
 
  Thanks!




Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread Morgan Reece
Don't use 'localhost' as your host name.  For example, if you wanted to use
the name 'node'; add another line to your hosts file like:

127.0.1.1 node.domain.local node

Then change all the host references in your configuration files to 'node'
-- also, don't forget to change the master/slave files as well.

Now, if you decide to use an external address it would need to be static.
This is easy to do, just follow this guide
http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
and replace '127.0.1.1' with whatever external address you decide on.

On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.comwrote:

 Can you please take this Cloudera mailing list?


 On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.comwrote:

 I am trying to run all Hadoop servers on a single Ubuntu localhost. All
 ports are open and my /etc/hosts file is

 127.0.0.1   frigate frigate.domain.locallocalhost
 # The following lines are desirable for IPv6 capable hosts
 ::1 ip6-localhost ip6-loopback
 fe00::0 ip6-localnet
 ff00::0 ip6-mcastprefix
 ff02::1 ip6-allnodes
 ff02::2 ip6-allrouters

 When trying to install cluster Cloudera manager fails with the following
 messages:

 Installation failed. Failed to receive heartbeat from agent.

 I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my
 provider. What configuration is missing?

 Thanks!




 --
 http://hortonworks.com/download/



Re: basic question about rack awareness and computation migration

2013-03-05 Thread Rohit Kochar
Hello ,
To be precise this is hidden from the developer and you need not write any code 
for this.
Whenever any file is stored in HDFS than it is splitted into block size of 
configured size and each block could potentially be stored on different 
datanode.All this information of which file contains which blocks resides with 
the namenode.

So essentially whenever a file is accessed via DFS Client it requests the  
NameNode for metadata,
which DFS client uses to provide the file in streaming fashion to enduser.

Since namenode knows the location of all the blocks/files ,a task can be 
scheduled by hadoop to be executed on the same node which is having data.

Thanks
Rohit Kochar

On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

 Hi hadoop users,
 
 I'm trying to find out if computation migration is something the developer 
 needs to worry about or if it's supposed to be hidden.
 
 I would like to use hadoop to take in a list of image paths in the hdfs and 
 then have each task compress these large, raw images into something much 
 smaller - say jpeg  files.  
 
 Input: list of paths
 Output: compressed jpeg
 
 Since I don't really need a reduce task (I'm more using hadoop for its 
 reliability and orchestration aspects), my mapper ought to just take the list 
 of image paths and then work on them.  As I understand it, each image will 
 likely be on multiple data nodes.  
 
 My question is how will each mapper task migrate the computation to the 
 data nodes?  I recall reading that the namenode is supposed to deal with 
 this.  Is it hidden from the developer?  Or as the developer, do I need to 
 discover where the data lies and then migrate the task to that node?  Since 
 my input is just a list of paths, it seems like the namenode couldn't really 
 do this for me.
 
 Another question: Where can I find out more about this?  I've looked up rack 
 awareness and computation migration but haven't really found much code 
 relating to either one - leading me to believe I'm not supposed to have to 
 write code to deal with this.
 
 Anyway, could someone please help me out or set me straight on this?
 
 Thanks,
 -Julian



Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread anton ashanin
I am at a loss. I have set an IP address that my node got by DHCP:
 127.0.0.1   localhost
192.168.1.6node

This has not helped. Cloudera Manager finds this host all right, but still
can not get a heartbeat from it next.
Maybe the problem is that at the moment of these experiments I have three
laptops with addresses assigned by DHCP all running at once?

To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I
try something else?
Please let me know on what Linux version you have managed to run Hadoop on
a local host only?


On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari 
jean-m...@spaggiari.org wrote:

 Hi Anton,

 Here is what my host is looking like:
 127.0.0.1   localhost
 192.168.1.2myserver

 JM

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  Morgan,
  Just did exactly as you suggested, my /etc/hosts:
  127.0.1.1 node.domain.local node
 
  Wiped out, annihilated my previous installation completely and
 reinstalled
  everything from scratch.
  The same problem with CLOUDERA MANAGER (FREE EDITION):
  Installation failed.  Failed to receive heartbeat from agent
  
 
  I will try now the the  bright idea from Jean, looks promising to me
 
 
 
  On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com
 wrote:
 
  Don't use 'localhost' as your host name.  For example, if you wanted to
  use the name 'node'; add another line to your hosts file like:
 
  127.0.1.1 node.domain.local node
 
  Then change all the host references in your configuration files to
 'node'
  -- also, don't forget to change the master/slave files as well.
 
  Now, if you decide to use an external address it would need to be
 static.
  This is easy to do, just follow this guide
  http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
  and replace '127.0.1.1' with whatever external address you decide on.
 
 
  On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas 
 sur...@hortonworks.com
  wrote:
 
  Can you please take this Cloudera mailing list?
 
 
  On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin 
 anton.asha...@gmail.com
  wrote:
 
  I am trying to run all Hadoop servers on a single Ubuntu localhost.
 All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the
 following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to
  my provider. What configuration is missing?
 
  Thanks!
 
 
 
 
  --
  http://hortonworks.com/download/
 
 
 



Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread yibing Shi
Hi Anton,

Cloudera manager needs fully qualified domain name. Run hostname -f to
check whether you have FQDN or not.

I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into
/etc/sysconfig/network, which then looks like the following:
NETWORKING=yes
HOSTNAME=myhost.my.domain
GATEWAY=10.2.2.254


http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf



On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote:

 I am at a loss. I have set an IP address that my node got by DHCP:
  127.0.0.1   localhost
 192.168.1.6node

 This has not helped. Cloudera Manager finds this host all right, but still
 can not get a heartbeat from it next.
 Maybe the problem is that at the moment of these experiments I have three
 laptops with addresses assigned by DHCP all running at once?

 To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I
 try something else?
 Please let me know on what Linux version you have managed to run Hadoop on
 a local host only?


 On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Anton,

 Here is what my host is looking like:
 127.0.0.1   localhost
 192.168.1.2myserver


 JM

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  Morgan,
  Just did exactly as you suggested, my /etc/hosts:
  127.0.1.1 node.domain.local node
 
  Wiped out, annihilated my previous installation completely and
 reinstalled
  everything from scratch.
  The same problem with CLOUDERA MANAGER (FREE EDITION):
  Installation failed.  Failed to receive heartbeat from agent
  
 
  I will try now the the  bright idea from Jean, looks promising to me
 
 
 
  On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com
 wrote:
 
  Don't use 'localhost' as your host name.  For example, if you wanted to
  use the name 'node'; add another line to your hosts file like:
 
  127.0.1.1 node.domain.local node
 
  Then change all the host references in your configuration files to
 'node'
  -- also, don't forget to change the master/slave files as well.
 
  Now, if you decide to use an external address it would need to be
 static.
  This is easy to do, just follow this guide
  http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
  and replace '127.0.1.1' with whatever external address you decide on.
 
 
  On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas 
 sur...@hortonworks.com
  wrote:
 
  Can you please take this Cloudera mailing list?
 
 
  On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin 
 anton.asha...@gmail.com
  wrote:
 
  I am trying to run all Hadoop servers on a single Ubuntu localhost.
 All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the
 following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem
 to
  my provider. What configuration is missing?
 
  Thanks!
 
 
 
 
  --
  http://hortonworks.com/download/
 
 
 





Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread anton ashanin
Do you run all Hadoop servers on a single host that gets IP by DHCP?
What do you have in /etc/hosts?

Thanks!


On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi
yibing@effectivemeasure.comwrote:

 Hi Anton,

 Cloudera manager needs fully qualified domain name. Run hostname -f to
 check whether you have FQDN or not.

 I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into
 /etc/sysconfig/network, which then looks like the following:
 NETWORKING=yes
 HOSTNAME=myhost.my.domain
 GATEWAY=10.2.2.254


 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf



 On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote:

 I am at a loss. I have set an IP address that my node got by DHCP:
  127.0.0.1   localhost
 192.168.1.6node

 This has not helped. Cloudera Manager finds this host all right, but
 still can not get a heartbeat from it next.
 Maybe the problem is that at the moment of these experiments I have three
 laptops with addresses assigned by DHCP all running at once?

 To make Hadoop work I am ready now to switch Ubuntu for CentOS or should
 I try something else?
 Please let me know on what Linux version you have managed to run Hadoop
 on a local host only?


 On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Anton,

 Here is what my host is looking like:
 127.0.0.1   localhost
 192.168.1.2myserver


 JM

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  Morgan,
  Just did exactly as you suggested, my /etc/hosts:
  127.0.1.1 node.domain.local node
 
  Wiped out, annihilated my previous installation completely and
 reinstalled
  everything from scratch.
  The same problem with CLOUDERA MANAGER (FREE EDITION):
  Installation failed.  Failed to receive heartbeat from agent
  
 
  I will try now the the  bright idea from Jean, looks promising to me
 
 
 
  On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com
 wrote:
 
  Don't use 'localhost' as your host name.  For example, if you wanted
 to
  use the name 'node'; add another line to your hosts file like:
 
  127.0.1.1 node.domain.local node
 
  Then change all the host references in your configuration files to
 'node'
  -- also, don't forget to change the master/slave files as well.
 
  Now, if you decide to use an external address it would need to be
 static.
  This is easy to do, just follow this guide
  http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
  and replace '127.0.1.1' with whatever external address you decide on.
 
 
  On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas 
 sur...@hortonworks.com
  wrote:
 
  Can you please take this Cloudera mailing list?
 
 
  On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin 
 anton.asha...@gmail.com
  wrote:
 
  I am trying to run all Hadoop servers on a single Ubuntu localhost.
 All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the
 following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem
 to
  my provider. What configuration is missing?
 
  Thanks!
 
 
 
 
  --
  http://hortonworks.com/download/
 
 
 






Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread yibing Shi
I didn't run all the services on a single server, but I doesn't matter
since the installation is the same no matter how many servers you are going
to install on.

I got the same error as you and it turned out that CM needs to be able to
know the FQDN. But I didn't use DHCP so it is easier for me to fix that. I
guess you might have to set up the DHCP server correctly for CM to find
your FQDN.


http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf



On Wed, Mar 6, 2013 at 9:56 AM, anton ashanin anton.asha...@gmail.comwrote:

 Do you run all Hadoop servers on a single host that gets IP by DHCP?
 What do you have in /etc/hosts?

 Thanks!


 On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi 
 yibing@effectivemeasure.com wrote:

 Hi Anton,

 Cloudera manager needs fully qualified domain name. Run hostname -f to
 check whether you have FQDN or not.

 I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into
 /etc/sysconfig/network, which then looks like the following:
 NETWORKING=yes
 HOSTNAME=myhost.my.domain
 GATEWAY=10.2.2.254


 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf



 On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote:

 I am at a loss. I have set an IP address that my node got by DHCP:
  127.0.0.1   localhost
 192.168.1.6node

 This has not helped. Cloudera Manager finds this host all right, but
 still can not get a heartbeat from it next.
 Maybe the problem is that at the moment of these experiments I have
 three laptops with addresses assigned by DHCP all running at once?

 To make Hadoop work I am ready now to switch Ubuntu for CentOS or should
 I try something else?
 Please let me know on what Linux version you have managed to run Hadoop
 on a local host only?


 On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Anton,

 Here is what my host is looking like:
 127.0.0.1   localhost
 192.168.1.2myserver


 JM

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  Morgan,
  Just did exactly as you suggested, my /etc/hosts:
  127.0.1.1 node.domain.local node
 
  Wiped out, annihilated my previous installation completely and
 reinstalled
  everything from scratch.
  The same problem with CLOUDERA MANAGER (FREE EDITION):
  Installation failed.  Failed to receive heartbeat from agent
  
 
  I will try now the the  bright idea from Jean, looks promising to me
 
 
 
  On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com
 wrote:
 
  Don't use 'localhost' as your host name.  For example, if you wanted
 to
  use the name 'node'; add another line to your hosts file like:
 
  127.0.1.1 node.domain.local node
 
  Then change all the host references in your configuration files to
 'node'
  -- also, don't forget to change the master/slave files as well.
 
  Now, if you decide to use an external address it would need to be
 static.
  This is easy to do, just follow this guide
  http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
  and replace '127.0.1.1' with whatever external address you decide on.
 
 
  On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas 
 sur...@hortonworks.com
  wrote:
 
  Can you please take this Cloudera mailing list?
 
 
  On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin 
 anton.asha...@gmail.com
  wrote:
 
  I am trying to run all Hadoop servers on a single Ubuntu
 localhost. All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the
 following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup
 modem to
  my provider. What configuration is missing?
 
  Thanks!
 
 
 
 
  --
  http://hortonworks.com/download/
 
 
 







Re: How to setup Cloudera Hadoop to run everything on a localhost?

2013-03-05 Thread anton ashanin
Do the problem of installing Hadoop on a single DHCP node exist for Apache
distribution of Hadoop as well?


On Wed, Mar 6, 2013 at 2:30 AM, Suresh Srinivas sur...@hortonworks.comwrote:

 folks, another gentle reminder. Please use cloudera lists.


 On Tue, Mar 5, 2013 at 2:56 PM, anton ashanin anton.asha...@gmail.comwrote:

 Do you run all Hadoop servers on a single host that gets IP by DHCP?
 What do you have in /etc/hosts?

 Thanks!


 On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi 
 yibing@effectivemeasure.com wrote:

 Hi Anton,

 Cloudera manager needs fully qualified domain name. Run hostname -f to
 check whether you have FQDN or not.

 I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN
 into /etc/sysconfig/network, which then looks like the following:
 NETWORKING=yes
 HOSTNAME=myhost.my.domain
 GATEWAY=10.2.2.254


 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf



 On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin 
 anton.asha...@gmail.comwrote:

 I am at a loss. I have set an IP address that my node got by DHCP:
  127.0.0.1   localhost
 192.168.1.6node

 This has not helped. Cloudera Manager finds this host all right, but
 still can not get a heartbeat from it next.
 Maybe the problem is that at the moment of these experiments I have
 three laptops with addresses assigned by DHCP all running at once?

 To make Hadoop work I am ready now to switch Ubuntu for CentOS or
 should I try something else?
 Please let me know on what Linux version you have managed to run Hadoop
 on a local host only?


 On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari 
 jean-m...@spaggiari.org wrote:

 Hi Anton,

 Here is what my host is looking like:
 127.0.0.1   localhost
 192.168.1.2myserver


 JM

 2013/3/5 anton ashanin anton.asha...@gmail.com:
  Morgan,
  Just did exactly as you suggested, my /etc/hosts:
  127.0.1.1 node.domain.local node
 
  Wiped out, annihilated my previous installation completely and
 reinstalled
  everything from scratch.
  The same problem with CLOUDERA MANAGER (FREE EDITION):
  Installation failed.  Failed to receive heartbeat from agent
  
 
  I will try now the the  bright idea from Jean, looks promising to me
 
 
 
  On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com
 wrote:
 
  Don't use 'localhost' as your host name.  For example, if you
 wanted to
  use the name 'node'; add another line to your hosts file like:
 
  127.0.1.1 node.domain.local node
 
  Then change all the host references in your configuration files to
 'node'
  -- also, don't forget to change the master/slave files as well.
 
  Now, if you decide to use an external address it would need to be
 static.
  This is easy to do, just follow this guide
  http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu
  and replace '127.0.1.1' with whatever external address you decide
 on.
 
 
  On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas 
 sur...@hortonworks.com
  wrote:
 
  Can you please take this Cloudera mailing list?
 
 
  On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin 
 anton.asha...@gmail.com
  wrote:
 
  I am trying to run all Hadoop servers on a single Ubuntu
 localhost. All
  ports are open and my /etc/hosts file is
 
  127.0.0.1   frigate frigate.domain.locallocalhost
  # The following lines are desirable for IPv6 capable hosts
  ::1 ip6-localhost ip6-loopback
  fe00::0 ip6-localnet
  ff00::0 ip6-mcastprefix
  ff02::1 ip6-allnodes
  ff02::2 ip6-allrouters
 
  When trying to install cluster Cloudera manager fails with the
 following
  messages:
 
  Installation failed. Failed to receive heartbeat from agent.
 
  I run my Ubuntu-12.04 host from home connected by WiFi/dialup
 modem to
  my provider. What configuration is missing?
 
  Thanks!
 
 
 
 
  --
  http://hortonworks.com/download/
 
 
 







 --
 http://hortonworks.com/download/



Re: basic question about rack awareness and computation migration

2013-03-05 Thread Julian Bui
Hi Rohit,

Thanks for responding.

 a task can be scheduled by hadoop to be executed on the same node which
is having data.

In my case, the mapper won't actually know where the data resides at the
time of being scheduled.  It only knows what data it will be accessing when
it reads in the keys.  In other words, the task will be already be running
by the time the mapper figures out what data must be accessed - so how can
hadoop know where to execute the code?

I'm still lost.  Please help if you can.

-Julian

On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar mnit.ro...@gmail.com wrote:

 Hello ,
 To be precise this is hidden from the developer and you need not write any
 code for this.
 Whenever any file is stored in HDFS than it is splitted into block size of
 configured size and each block could potentially be stored on different
 datanode.All this information of which file contains which blocks resides
 with the namenode.

 So essentially whenever a file is accessed via DFS Client it requests the
  NameNode for metadata,
 which DFS client uses to provide the file in streaming fashion to enduser.

 Since namenode knows the location of all the blocks/files ,a task can be
 scheduled by hadoop to be executed on the same node which is having data.

 Thanks
 Rohit Kochar

 On 05-Mar-2013, at 5:19 PM, Julian Bui wrote:

  Hi hadoop users,
 
  I'm trying to find out if computation migration is something the
 developer needs to worry about or if it's supposed to be hidden.
 
  I would like to use hadoop to take in a list of image paths in the hdfs
 and then have each task compress these large, raw images into something
 much smaller - say jpeg  files.
 
  Input: list of paths
  Output: compressed jpeg
 
  Since I don't really need a reduce task (I'm more using hadoop for its
 reliability and orchestration aspects), my mapper ought to just take the
 list of image paths and then work on them.  As I understand it, each image
 will likely be on multiple data nodes.
 
  My question is how will each mapper task migrate the computation to
 the data nodes?  I recall reading that the namenode is supposed to deal
 with this.  Is it hidden from the developer?  Or as the developer, do I
 need to discover where the data lies and then migrate the task to that
 node?  Since my input is just a list of paths, it seems like the namenode
 couldn't really do this for me.
 
  Another question: Where can I find out more about this?  I've looked up
 rack awareness and computation migration but haven't really found much
 code relating to either one - leading me to believe I'm not supposed to
 have to write code to deal with this.
 
  Anyway, could someone please help me out or set me straight on this?
 
  Thanks,
  -Julian




Re: basic question about rack awareness and computation migration

2013-03-05 Thread Harsh J
Your concern is correct: If your input is a list of files, rather than
the files themselves, then the tasks would not be data-local - since
the task input would just be the list of files, and the files' data
may reside on any node/rack of the cluster.

However, your job will still run as the HDFS reads do remote reads
transparently without developer intervention and all will still work
as you've written it to. If a block is found local to the DN, it is
read locally as well - all of this is automatic.

Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui julian...@gmail.com wrote:
 Hi hadoop users,

 I'm trying to find out if computation migration is something the developer
 needs to worry about or if it's supposed to be hidden.

 I would like to use hadoop to take in a list of image paths in the hdfs and
 then have each task compress these large, raw images into something much
 smaller - say jpeg  files.

 Input: list of paths
 Output: compressed jpeg

 Since I don't really need a reduce task (I'm more using hadoop for its
 reliability and orchestration aspects), my mapper ought to just take the
 list of image paths and then work on them.  As I understand it, each image
 will likely be on multiple data nodes.

 My question is how will each mapper task migrate the computation to the
 data nodes?  I recall reading that the namenode is supposed to deal with
 this.  Is it hidden from the developer?  Or as the developer, do I need to
 discover where the data lies and then migrate the task to that node?  Since
 my input is just a list of paths, it seems like the namenode couldn't really
 do this for me.

 Another question: Where can I find out more about this?  I've looked up
 rack awareness and computation migration but haven't really found much
 code relating to either one - leading me to believe I'm not supposed to have
 to write code to deal with this.

 Anyway, could someone please help me out or set me straight on this?

 Thanks,
 -Julian



--
Harsh J


Re: basic question about rack awareness and computation migration

2013-03-05 Thread Julian Bui
Thanks Harsh,

 Are your input lists big (for each compressed output)? And is the list
arbitrary or a defined list per goal?

I dictate what my inputs will look like.  If they need to be list of image
files, then I can do that.  If they need to be the images themselves as you
suggest, then I can do that too but I'm not exactly sure what that would
look like.  Basically, I will try to format my inputs in the way that makes
the most sense from a locality point of view.

Since all the keys must be writable, I explored the Writable interface and
found the interesting sub-classes:

   - FileSplit
   - BlockLocation
   - BytesWritable

These all look somewhat promising as they kind of reveal the location
information of the files.

I'm not exactly sure how I would use these to hint at the data locations.
 Since these chunks of the file appear to be somewhat arbitrary in size and
offset, I don't know how I could perform imagery operations on them.  For
example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it
difficult for me to use that information to give to my image libraries -
does 0x100-0x400 correspond to some region/MBR within the image?  I'm not
sure how to make use of this information.

The responses I've gotten so far indicate to me that HDFS kind of does the
computation migration for me but that I have to give it enough information
to work with.  If someone could point to some detailed reading about this
subject that would be pretty helpful, as I just can't find the
documentation for it.

Thanks again,
-Julian

On Tue, Mar 5, 2013 at 5:39 PM, Harsh J ha...@cloudera.com wrote:

 Your concern is correct: If your input is a list of files, rather than
 the files themselves, then the tasks would not be data-local - since
 the task input would just be the list of files, and the files' data
 may reside on any node/rack of the cluster.

 However, your job will still run as the HDFS reads do remote reads
 transparently without developer intervention and all will still work
 as you've written it to. If a block is found local to the DN, it is
 read locally as well - all of this is automatic.

 Are your input lists big (for each compressed output)? And is the list
 arbitrary or a defined list per goal?

 On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui julian...@gmail.com wrote:
  Hi hadoop users,
 
  I'm trying to find out if computation migration is something the
 developer
  needs to worry about or if it's supposed to be hidden.
 
  I would like to use hadoop to take in a list of image paths in the hdfs
 and
  then have each task compress these large, raw images into something much
  smaller - say jpeg  files.
 
  Input: list of paths
  Output: compressed jpeg
 
  Since I don't really need a reduce task (I'm more using hadoop for its
  reliability and orchestration aspects), my mapper ought to just take the
  list of image paths and then work on them.  As I understand it, each
 image
  will likely be on multiple data nodes.
 
  My question is how will each mapper task migrate the computation to the
  data nodes?  I recall reading that the namenode is supposed to deal with
  this.  Is it hidden from the developer?  Or as the developer, do I need
 to
  discover where the data lies and then migrate the task to that node?
  Since
  my input is just a list of paths, it seems like the namenode couldn't
 really
  do this for me.
 
  Another question: Where can I find out more about this?  I've looked up
  rack awareness and computation migration but haven't really found
 much
  code relating to either one - leading me to believe I'm not supposed to
 have
  to write code to deal with this.
 
  Anyway, could someone please help me out or set me straight on this?
 
  Thanks,
  -Julian



 --
 Harsh J



RE: Hadoop cluster setup - could not see second datanode

2013-03-05 Thread Brahma Reddy Battula
Although Hadoop is designed and developed for distributed computing it can be 
run on a single node in pseudo distributed mode and with multiple data node on 
single machine . Developers often run multiple data nodes on single node to 
develop and test distributed features,data node behavior, Name node interaction 
with data node and for other reasons.

Please go through following blog for same..

http://www.blogger.com/blogger.g?blogID=2277703965936900657#editor/target=post;postID=8231904039775612388

From: Robert Evans [ev...@yahoo-inc.com]
Sent: Tuesday, March 05, 2013 11:57 PM
To: user@hadoop.apache.org
Subject: Re: Hadoop cluster setup - could not see second datanode

Why would you need several data nodes?  It is simple to have one data node and 
one name node on the same machine.  I believe that you can make multiple data 
nodes run on the same machine, but it would take quite a bit of configuration 
work to do it, and it would only really be helpful for you to do some very 
specific testing involving multiple data nodes.

--Bobby

From: 卖报的小行家 85469...@qq.commailto:85469...@qq.com
Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org 
user@hadoop.apache.orgmailto:user@hadoop.apache.org
Date: Tuesday, March 5, 2013 8:41 AM
To: user user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re:RE: Hadoop cluster setup - could not see second datanode

Hello,
Can  Namenode and several datanodes exist in one machine?
I only have one PC. I want to configure it like this way.

BRs//Julian


-- Original --
From:  AMARNATH, 
Balachandarbalachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com;
Date:  Tue, Mar 5, 2013 07:55 PM
To:  
user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org;
Subject:  RE: Hadoop cluster setup - could not see second datanode

I fixed it the below issue :)


Regards
Bala

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 05 March 2013 17:05
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Hadoop cluster setup - could not see second datanode

Thanks for the information,

Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, 
and a separate data node. I use the following configuration for my hdfs-site.xml

configuration

  property
namefs.default.name/name
valuelocalhost:9000/value
  /property

  property
namedfs.data.dir/name
value/home/bala/data/value
  /property

  property
namedfs.name.dir/name
value/home/bala/name/value
  /property
/configuration


In namenode, I have added the datanode hostnames (machine1 and machine2).
When I do ‘start-all.sh’, I see the log that the data node is starting in both 
the machines but I went to the browser in the namenode, I see only one live 
node. (That is the namenode which is configured as datanode)

Any hint here will help me


With regards
Bala





From: Mahesh Balija [mailto:balijamahesh@gmail.com]
Sent: 05 March 2013 14:15
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Re: Hadoop file system

You can be able to use Hdfs alone in the distributed mode to fulfill your 
requirement.
Hdfs has the Filesystem java api through which you can interact with the HDFS 
from your client.
HDFS is good if you have less number of files with huge size rather than you 
having many files with small size.

Best,
Mahesh Balija,
Calsoft Labs.
On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote:

Hi,

I am new to hdfs. In my java application, I need to perform ‘similar operation’ 
over large number of files. I would like to store those files in distributed 
machines. I don’t think, I will need map reduce paradigm. But however I would 
like to use HDFS for file storage and access. Is it possible (or nice idea) to 
use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS 
so that I can read/write in distributed environment ? Any thoughts here will be 
helpful.


With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.


The information in this e-mail is confidential. The contents may not be 

Map reduce technique

2013-03-05 Thread AMARNATH, Balachandar
Hi,

I am new to map reduce paradigm. I read in a tutorial that says that 'map' 
function splits the data and into key value pairs. This means, the map-reduce 
framework automatically splits the data into pieces or do we need to explicitly 
provide the method to split the data into pieces. If it does automatically, how 
it splits an image file (size etc)? I see, processing of an image file as a 
whole will give different results than processing them in chunks.



With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.
If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.
Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.
All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.



Re: FileStatus.getPath

2013-03-05 Thread Harsh J
The FileStatus is a container of metadata for a specific path, and
hence carries the Path object the rest of the details are for.

What do you exactly mean by has no defined contract? If you want a
qualified path (for a specific FS), then doing path.makeQualified(…)
is always the right way.

On Tue, Mar 5, 2013 at 11:31 PM, Jay Vyas jayunit...@gmail.com wrote:
 Hi it appears that:

 http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileStatus.html

 getPath()

 has no defined contract.

 Why does FileStatus have a getPAth method?  Would it be the equivalent
 effect to
 simply make a path qualified using the FileSystem object?

 i.e. path.makeQualified(FileSystem.get()) ?

 --
 Jay Vyas
 http://jayunit100.blogspot.com



--
Harsh J


Re: Map reduce technique

2013-03-05 Thread Sandy Ryza
Hi Balachandar,

In MapReduce, interpreting input files as key value pairs is accomplished
through InputFormats.  Some common InputFormats are TextInputFormat, which
uses lines in a text file as values and their byte offset into the file as
keys, KeyValueTextInputFormat, which interprets the first token on a line
as the key and the rest as the value, and WholeFileInputFormat, which uses
an entire line as a value.  If you wanted to process an image file in a
specific way, you would probably need to supply your own InputFormat.

Does that help?

-Sandy

On Tue, Mar 5, 2013 at 9:37 PM, AMARNATH, Balachandar 
balachandar.amarn...@airbus.com wrote:

  Hi,

 I am new to map reduce paradigm. I read in a tutorial that says that ‘map’
 function splits the data and into key value pairs. This means, the
 map-reduce framework automatically splits the data into pieces or do we
 need to explicitly provide the method to split the data into pieces. If it
 does automatically, how it splits an image file (size etc)? I see,
 processing of an image file as a whole will give different results than
 processing them in chunks.



 With thanks and regards
 Balachandar




 The information in this e-mail is confidential. The contents may not be 
 disclosed or used by anyone other than the addressee. Access to this e-mail 
 by anyone else is unauthorised.
 If you are not the intended recipient, please notify Airbus immediately and 
 delete this e-mail.
 Airbus cannot accept any responsibility for the accuracy or completeness of 
 this e-mail as it has been sent over public networks. If you have any 
 concerns over the content of this message or its Accuracy or Integrity, 
 please contact Airbus immediately.
 All outgoing e-mails from Airbus are checked using regularly updated virus 
 scanning software but you should take whatever measures you deem to be 
 appropriate to ensure that this message and any attachments are virus free.




RE: Map reduce technique

2013-03-05 Thread Samir Kumar Das Mohapatra
I think  you have to look the sequence file  as input format .

Basically, the way this works is, you will have a separate Java process that 
takes several image files, reads the ray bytes into memory, then stores the 
data into a key-value pair in a SequenceFile. Keep going and keep writing into 
HDFS. This may take a while, but you'll only have to do it once.

Regards,
Samir.

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 06 March 2013 11:07
To: user@hadoop.apache.org
Subject: Map reduce technique

Hi,

I am new to map reduce paradigm. I read in a tutorial that says that 'map' 
function splits the data and into key value pairs. This means, the map-reduce 
framework automatically splits the data into pieces or do we need to explicitly 
provide the method to split the data into pieces. If it does automatically, how 
it splits an image file (size etc)? I see, processing of an image file as a 
whole will give different results than processing them in chunks.



With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.


RE: Map reduce technique

2013-03-05 Thread Samir Kumar Das Mohapatra
job.setInputFormatClass(SequenceFileInputFormat.class);

Just you have to follow Hadoop API from apache web-site

Hints:

1) Create sequence file prior to the Job.(Java Algorithm )

Example POC: You have to change based on your requirement



import java.io.IOException;

import java.net.URI;



import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IOUtils;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.SequenceFile;

import org.apache.hadoop.io.Text;



//White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 
5375-5384). OReilly Media - A. Kindle Edition.



public class SequenceFileWriteDemo {



private static final String[] DATA = { One, two, buckle my shoe, Three, 
four, shut the door, Five, six, pick up sticks, Seven, eight, lay them 
straight, Nine, ten, a big fat hen };



public static void main( String[] args) throws IOException {

//local file path

String uri = /home/hadoop/Desktop/Image/test_02.txt;

Configuration conf = new Configuration();

FileSystem fs = FileSystem.get(URI.create( uri), conf);

Path path = new Path( uri);

IntWritable key = new IntWritable();

Text value = new Text();

SequenceFile.Writer writer = null;

try {

writer = SequenceFile.createWriter( fs, 
conf, path, key.getClass(), value.getClass());

for (int i = 0; i  100; i ++) {

key.set( 100 - i);

value.set( DATA[ i % 
DATA.length]);

// System.out.printf([% 
s]\t% s\t% s\n, writer.getLength(), key, value);

writer.append( key, value); 
}

} finally

{ IOUtils.closeStream( writer);

}

}

}



Note: you have to convert all image file to one sequence file.



2)  Put it into the HDFS

3) Write MAP/Reduce  based on the logic what you need



From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 06 March 2013 11:24
To: user@hadoop.apache.org
Subject: RE: Map reduce technique

Thanks for the mail,

Can u please share few links to start with?


Regards
Bala

From: Samir Kumar Das Mohapatra [mailto:dasmo...@adobe.com]
Sent: 06 March 2013 11:21
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: RE: Map reduce technique

I think  you have to look the sequence file  as input format .

Basically, the way this works is, you will have a separate Java process that 
takes several image files, reads the ray bytes into memory, then stores the 
data into a key-value pair in a SequenceFile. Keep going and keep writing into 
HDFS. This may take a while, but you'll only have to do it once.

Regards,
Samir.

From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com]
Sent: 06 March 2013 11:07
To: user@hadoop.apache.orgmailto:user@hadoop.apache.org
Subject: Map reduce technique

Hi,

I am new to map reduce paradigm. I read in a tutorial that says that 'map' 
function splits the data and into key value pairs. This means, the map-reduce 
framework automatically splits the data into pieces or do we need to explicitly 
provide the method to split the data into pieces. If it does automatically, how 
it splits an image file (size etc)? I see, processing of an image file as a 
whole will give different results than processing them in chunks.



With thanks and regards
Balachandar




The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over public networks. If you have any concerns 
over the content of this message or its Accuracy or Integrity, please contact 
Airbus immediately.

All outgoing e-mails from Airbus are checked using regularly updated virus 
scanning software but you should take whatever measures you deem to be 
appropriate to ensure that this message and any attachments are virus free.

The information in this e-mail is confidential. The contents may not be 
disclosed or used by anyone other than the addressee. Access to this e-mail by 
anyone else is unauthorised.

If you are not the intended recipient, please notify Airbus immediately and 
delete this e-mail.

Airbus cannot accept any responsibility for the accuracy or completeness of 
this e-mail as it has been sent over