Socket does not have a channel
Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
Re: Need help optimizing reducer
The reason why the reducer is fast upto 66% is be because of the Sorting and Shuffling phase of the reduce and when the actual task is NOT yet started. The reduce side is divided into 3 phases of 33~% each - shuffle (fetch data), sort and finally user-code (reduce). That is why your reduce might be faster upto 66%. In order to speed up your program you may either have to have more number of reducers or make your reducer code as optimized as possible. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.com wrote: Hi all, I have 1 reducer and I have around 600 thousand unique keys coming to it. The total data is only around 30 mb. My logic doesn't allow me to have more than 1 reducer. It's taking too long to complete, around 2 hours. (till 66% it's fast then it slows down/ I don't really think it has started doing anything till 66% but then why does it show like that?). Are there any job execution parameters that can help improve reducer performace? Any suggestions to improve things when we have to live with just one reducer? thanks, Austin
Re: Need help optimizing reducer
Hi Austin, I am not sure whether you had this kind of mistake or not but in any case, I would like to state: that you might be trying to read whole input values,(corresponding key values) to reducer function from beginning to end(which is the output value of mapper) while merging them into one output. If you can send reducer code, you may get more useful replies. On Tue, Mar 5, 2013 at 1:00 PM, Mahesh Balija balijamahesh@gmail.comwrote: The reason why the reducer is fast upto 66% is be because of the Sorting and Shuffling phase of the reduce and when the actual task is NOT yet started. The reduce side is divided into 3 phases of 33~% each - shuffle (fetch data), sort and finally user-code (reduce). That is why your reduce might be faster upto 66%. In order to speed up your program you may either have to have more number of reducers or make your reducer code as optimized as possible. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.comwrote: Hi all, I have 1 reducer and I have around 600 thousand unique keys coming to it. The total data is only around 30 mb. My logic doesn't allow me to have more than 1 reducer. It's taking too long to complete, around 2 hours. (till 66% it's fast then it slows down/ I don't really think it has started doing anything till 66% but then why does it show like that?). Are there any job execution parameters that can help improve reducer performace? Any suggestions to improve things when we have to live with just one reducer? thanks, Austin
Re: Need help optimizing reducer
I mean, while trying to add newcoming reducer input value to already merged input values,to construct whole input values of corresponding key value to reducer, you might be reading every input values(which are output value of mapper) from beginning to end. On Tue, Mar 5, 2013 at 1:46 PM, Fatih Haltas fatih.hal...@nyu.edu wrote: Hi Austin, I am not sure whether you had this kind of mistake or not but in any case, I would like to state: that you might be trying to read whole input values,(corresponding key values) to reducer function from beginning to end(which is the output value of mapper) while merging them into one output. If you can send reducer code, you may get more useful replies. On Tue, Mar 5, 2013 at 1:00 PM, Mahesh Balija balijamahesh@gmail.comwrote: The reason why the reducer is fast upto 66% is be because of the Sorting and Shuffling phase of the reduce and when the actual task is NOT yet started. The reduce side is divided into 3 phases of 33~% each - shuffle (fetch data), sort and finally user-code (reduce). That is why your reduce might be faster upto 66%. In order to speed up your program you may either have to have more number of reducers or make your reducer code as optimized as possible. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 1:27 AM, Austin Chungath austi...@gmail.comwrote: Hi all, I have 1 reducer and I have around 600 thousand unique keys coming to it. The total data is only around 30 mb. My logic doesn't allow me to have more than 1 reducer. It's taking too long to complete, around 2 hours. (till 66% it's fast then it slows down/ I don't really think it has started doing anything till 66% but then why does it show like that?). Are there any job execution parameters that can help improve reducer performace? Any suggestions to improve things when we have to live with just one reducer? thanks, Austin
Hadoop cluster setup - could not see second datanode
Thanks for the information, Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, and a separate data node. I use the following configuration for my hdfs-site.xml configuration property namefs.default.name/name valuelocalhost:9000/value /property property namedfs.data.dir/name value/home/bala/data/value /property property namedfs.name.dir/name value/home/bala/name/value /property /configuration In namenode, I have added the datanode hostnames (machine1 and machine2). When I do 'start-all.sh', I see the log that the data node is starting in both the machines but I went to the browser in the namenode, I see only one live node. (That is the namenode which is configured as datanode) Any hint here will help me With regards Bala From: Mahesh Balija [mailto:balijamahesh@gmail.com] Sent: 05 March 2013 14:15 To: user@hadoop.apache.org Subject: Re: Hadoop file system You can be able to use Hdfs alone in the distributed mode to fulfill your requirement. Hdfs has the Filesystem java api through which you can interact with the HDFS from your client. HDFS is good if you have less number of files with huge size rather than you having many files with small size. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote: Hi, I am new to hdfs. In my java application, I need to perform 'similar operation' over large number of files. I would like to store those files in distributed machines. I don't think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it possible (or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS so that I can read/write in distributed environment ? Any thoughts here will be helpful. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
basic question about rack awareness and computation migration
Hi hadoop users, I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden. I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg files. Input: list of paths Output: compressed jpeg Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them. As I understand it, each image will likely be on multiple data nodes. My question is how will each mapper task migrate the computation to the data nodes? I recall reading that the namenode is supposed to deal with this. Is it hidden from the developer? Or as the developer, do I need to discover where the data lies and then migrate the task to that node? Since my input is just a list of paths, it seems like the namenode couldn't really do this for me. Another question: Where can I find out more about this? I've looked up rack awareness and computation migration but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this. Anyway, could someone please help me out or set me straight on this? Thanks, -Julian
RE: Hadoop cluster setup - could not see second datanode
I fixed it the below issue :) Regards Bala From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 05 March 2013 17:05 To: user@hadoop.apache.org Subject: Hadoop cluster setup - could not see second datanode Thanks for the information, Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, and a separate data node. I use the following configuration for my hdfs-site.xml configuration property namefs.default.name/name valuelocalhost:9000/value /property property namedfs.data.dir/name value/home/bala/data/value /property property namedfs.name.dir/name value/home/bala/name/value /property /configuration In namenode, I have added the datanode hostnames (machine1 and machine2). When I do 'start-all.sh', I see the log that the data node is starting in both the machines but I went to the browser in the namenode, I see only one live node. (That is the namenode which is configured as datanode) Any hint here will help me With regards Bala From: Mahesh Balija [mailto:balijamahesh@gmail.com] Sent: 05 March 2013 14:15 To: user@hadoop.apache.org Subject: Re: Hadoop file system You can be able to use Hdfs alone in the distributed mode to fulfill your requirement. Hdfs has the Filesystem java api through which you can interact with the HDFS from your client. HDFS is good if you have less number of files with huge size rather than you having many files with small size. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote: Hi, I am new to hdfs. In my java application, I need to perform 'similar operation' over large number of files. I would like to store those files in distributed machines. I don't think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it possible (or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS so that I can read/write in distributed environment ? Any thoughts here will be helpful. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
JobTracker client - max connections
Hi all, I'm implementing an API over the JobTracker client - JobClient. My plan is to have a pool of JobClient objects that will expose the ability to submit jobs, poll status etc. My question is: Should I set a maximum pool size ? How many connections aree too many connection for the JobTracker ? any suggestions for what Pool to use ? Thanks, Amit.
S3N copy creating recursive folders
Hi, I am using Hadoop 1.0.3 and trying to execute: hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData This ends up with: cp: java.io.IOException: mkdirs: Pathname too long. Limit 8000 characters, 1000 levels. When I try to list the folder recursively /test/srcData: it lists 998 folders like: drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData/srcData Is there a problem with s3n filesystem ?? Cheers, Subroto Sanyal signature.asc Description: Message signed with OpenPGP using GPGMail
Re:S3N copy creating recursive folders
Hi Subroto, I didn't use the s3n filesystem.But from the output cp: java.io.IOException: mkdirs: Pathname too long. Limit 8000 characters, 1000 levels., I think this is because the problem of the path. Is the path longer than 8000 characters or the level is more than 1000? You only have 998 folders.Maybe the last one is more than 8000 characters.Why not count the last one's length? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 10:22 PM To: useruser@hadoop.apache.org; Subject: S3N copy creating recursive folders Hi, I am using Hadoop 1.0.3 and trying to execute: hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData This ends up with: cp: java.io.IOException: mkdirs: Pathname too long. Limit 8000 characters, 1000 levels. When I try to list the folder recursively /test/srcData: it lists 998 folders like: drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData/srcData Is there a problem with s3n filesystem ?? Cheers, Subroto Sanyal
Re: S3N copy creating recursive folders
Hi, Its not because there are too many recursive folders in S3 bucket; in-fact there is no recursive folder in the source. If I list the S3 bucket with Native S3 tools I can find a file srcData with size 0 in the folder srcData. The copy command keeps on creating folder /test/srcData/srcData/srcData (keep on appending srcData). Cheers, Subroto Sanyal On Mar 5, 2013, at 3:32 PM, wrote: Hi Subroto, I didn't use the s3n filesystem.But from the output cp: java.io.IOException: mkdirs: Pathname too long. Limit 8000 characters, 1000 levels., I think this is because the problem of the path. Is the path longer than 8000 characters or the level is more than 1000? You only have 998 folders.Maybe the last one is more than 8000 characters.Why not count the last one's length? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 10:22 PM To: useruser@hadoop.apache.org; Subject: S3N copy creating recursive folders Hi, I am using Hadoop 1.0.3 and trying to execute: hadoop fs -cp s3n://acessKey:acesssec...@bucket.name/srcData /test/srcData This ends up with: cp: java.io.IOException: mkdirs: Pathname too long. Limit 8000 characters, 1000 levels. When I try to list the folder recursively /test/srcData: it lists 998 folders like: drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData drwxr-xr-x - root supergroup 0 2013-03-05 08:49 /test/srcData/srcData/srcData/srcData/srcData/srcData Is there a problem with s3n filesystem ?? Cheers, Subroto Sanyal
Re:RE: Hadoop cluster setup - could not see second datanode
Hello, Can Namenode and several datanodes exist in one machine? I only have one PC. I want to configure it like this way. BRs//Julian -- Original -- From: AMARNATH, Balachandarbalachandar.amarn...@airbus.com; Date: Tue, Mar 5, 2013 07:55 PM To: user@hadoop.apache.orguser@hadoop.apache.org; Subject: RE: Hadoop cluster setup - could not see second datanode I fixed it the below issue J Regards Bala From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 05 March 2013 17:05 To: user@hadoop.apache.org Subject: Hadoop cluster setup - could not see second datanode Thanks for the information, Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, and a separate data node. I use the following configuration for my hdfs-site.xml configuration property namefs.default.name/name valuelocalhost:9000/value /property property namedfs.data.dir/name value/home/bala/data/value /property property namedfs.name.dir/name value/home/bala/name/value /property /configuration In namenode, I have added the datanode hostnames (machine1 and machine2). When I do ??start-all.sh??, I see the log that the data node is starting in both the machines but I went to the browser in the namenode, I see only one live node. (That is the namenode which is configured as datanode) Any hint here will help me With regards Bala From: Mahesh Balija [mailto:balijamahesh@gmail.com] Sent: 05 March 2013 14:15 To: user@hadoop.apache.org Subject: Re: Hadoop file system You can be able to use Hdfs alone in the distributed mode to fulfill your requirement. Hdfs has the Filesystem java api through which you can interact with the HDFS from your client. HDFS is good if you have less number of files with huge size rather than you having many files with small size. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar balachandar.amarn...@airbus.com wrote: Hi, I am new to hdfs. In my java application, I need to perform ??similar operation?? over large number of files. I would like to store those files in distributed machines. I don??t think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it possible (or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS so that I can read/write in distributed environment ? Any thoughts here will be helpful. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised.If you are not the intended recipient, please notify Airbus immediately and delete this e-mail.Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately.All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised.If you are not the intended recipient, please notify Airbus immediately and delete this e-mail.Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately.All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
Re:Socket does not have a channel
Hi, Which revision of hadoop? and what's the situation to report the Exception? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 04:46 PM To: useruser@hadoop.apache.org; Subject: Socket does not have a channel Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
Re: Socket does not have a channel
Hi Julian, This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0. Cheers, Subroto Sanyal On Mar 5, 2013, at 3:50 PM, wrote: Hi, Which revision of hadoop? and what's the situation to report the Exception? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 04:46 PM To: useruser@hadoop.apache.org; Subject: Socket does not have a channel Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
?????? Socket does not have a channel
Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such classes the log mentioned.Maybe you can read the code first. -- -- ??: Subrotossan...@datameer.com; : 2013??3??5??(??) 10:56 ??: useruser@hadoop.apache.org; : Re: Socket does not have a channel Hi Julian, This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0. Cheers, Subroto Sanyal On Mar 5, 2013, at 3:50 PM, wrote: Hi, Which revision of hadoop? and what's the situation to report the Exception? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 04:46 PM To: useruser@hadoop.apache.org; Subject: Socket does not have a channel Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
Transpose
Hi I have data in a file as follows . There are 3 columns separated by semicolon(;). Each column would have multiple values separated by comma (,). 11,22,33;144,244,344;yny; I need output data in below format. It is like transposing values of each column. 11 144 y 22 244 n 33 344 y Can we write map reduce program to achieve this. Could you help on the code on how to write. Thanks
Re: Transpose
Yes you can. You read in the row in each iteration of Mapper.map() Text input. You then output 3 times to the collector one for each row of the matrix. Spin,sort, and reduce as needed. Sent from a remote device. Please excuse any typos... Mike Segel On Mar 5, 2013, at 9:11 AM, Mix Nin pig.mi...@gmail.com wrote: Hi I have data in a file as follows . There are 3 columns separated by semicolon(;). Each column would have multiple values separated by comma (,). 11,22,33;144,244,344;yny; I need output data in below format. It is like transposing values of each column. 11 144 y 22 244 n 33 344 y Can we write map reduce program to achieve this. Could you help on the code on how to write. Thanks
Re: Hadoop cluster setup - could not see second datanode
Why would you need several data nodes? It is simple to have one data node and one name node on the same machine. I believe that you can make multiple data nodes run on the same machine, but it would take quite a bit of configuration work to do it, and it would only really be helpful for you to do some very specific testing involving multiple data nodes. --Bobby From: 卖报的小行家 85469...@qq.commailto:85469...@qq.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Tuesday, March 5, 2013 8:41 AM To: user user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re:RE: Hadoop cluster setup - could not see second datanode Hello, Can Namenode and several datanodes exist in one machine? I only have one PC. I want to configure it like this way. BRs//Julian -- Original -- From: AMARNATH, Balachandarbalachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com; Date: Tue, Mar 5, 2013 07:55 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org; Subject: RE: Hadoop cluster setup - could not see second datanode I fixed it the below issue :) Regards Bala From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 05 March 2013 17:05 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Hadoop cluster setup - could not see second datanode Thanks for the information, Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, and a separate data node. I use the following configuration for my hdfs-site.xml configuration property namefs.default.name/name valuelocalhost:9000/value /property property namedfs.data.dir/name value/home/bala/data/value /property property namedfs.name.dir/name value/home/bala/name/value /property /configuration In namenode, I have added the datanode hostnames (machine1 and machine2). When I do ‘start-all.sh’, I see the log that the data node is starting in both the machines but I went to the browser in the namenode, I see only one live node. (That is the namenode which is configured as datanode) Any hint here will help me With regards Bala From: Mahesh Balija [mailto:balijamahesh@gmail.com] Sent: 05 March 2013 14:15 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Hadoop file system You can be able to use Hdfs alone in the distributed mode to fulfill your requirement. Hdfs has the Filesystem java api through which you can interact with the HDFS from your client. HDFS is good if you have less number of files with huge size rather than you having many files with small size. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote: Hi, I am new to hdfs. In my java application, I need to perform ‘similar operation’ over large number of files. I would like to store those files in distributed machines. I don’t think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it possible (or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS so that I can read/write in distributed environment ? Any thoughts here will be helpful. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this
Re: Transpose
Hi, Essentially what you want to do is group your data points by their position in the column, and have each reduce call construct the data for each row into a row. To have each record that the mapper processes be one of the columns, you can use TextInputFormat with conf.set(textinputformat.record.delimiter, ;). Your mapper will receive keys as LongWritables specifying the byte index into the input file, and Text as values. The mapper will tokenize the input string. Emiting a map output for each data point in each column, you can then use secondary sort to send the data to the right place in the right order (see http://vangjee.wordpress.com/2012/03/20/secondary-sorting-aka-sorting-values-in-hadoops-mapreduce-programming-paradigm/). Your composite key would look like (index of data point in column, which is the row index; the LongWritable passed in as the map input key). Each reduce call would get all the points in a single row. You would sort/group by row index, and within a reduce's values, sort by byte index so that entries from earlier columns come before later ones. Does that make sense? Sandy On Tue, Mar 5, 2013 at 7:11 AM, Mix Nin pig.mi...@gmail.com wrote: Hi I have data in a file as follows . There are 3 columns separated by semicolon(;). Each column would have multiple values separated by comma (,). 11,22,33;144,244,344;yny; I need output data in below format. It is like transposing values of each column. 11 144 y 22 244 n 33 344 y Can we write map reduce program to achieve this. Could you help on the code on how to write. Thanks
Re: 回复: Socket does not have a channel
Try setting dfs.client.use.legacy.blockreader to true ∞ Shashwat Shriparv On Tue, Mar 5, 2013 at 8:39 PM, 卖报的小行家 85469...@qq.com wrote: Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such classes the log mentioned.Maybe you can read the code first. -- 原始邮件 -- *发件人:* Subrotossan...@datameer.com; *发送时间:* 2013年3月5日(星期二) 晚上10:56 *收件人:* useruser@hadoop.apache.org; ** *主题:* Re: Socket does not have a channel Hi Julian, This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0. Cheers, Subroto Sanyal On Mar 5, 2013, at 3:50 PM, 卖报的小行家 wrote: Hi, Which revision of hadoop? and what's the situation to report the Exception? BRs//Julian -- Original -- *From: * Subrotossan...@datameer.com; *Date: * Tue, Mar 5, 2013 04:46 PM *To: * useruser@hadoop.apache.org; ** *Subject: * Socket does not have a channel Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
How to setup Cloudera Hadoop to run everything on a localhost?
I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks!
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Hi Anton, Can you try to add something like: your.local.ip.addressyourhostname into your hosts file? Like: 192.168.1.2 masterserver 2013/3/5 anton ashanin anton.asha...@gmail.com: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks!
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Jean, thanks for trying to help. I get my IP address by DHCP. Every time I start my Ubuntu I possibly can get a different IP address from my WiFi modem /router. Will it be ok to add static address from 192.168.*.* to /etc/hosts in this case? On Tue, Mar 5, 2013 at 9:47 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Can you try to add something like: your.local.ip.addressyourhostname into your hosts file? Like: 192.168.1.2 masterserver 2013/3/5 anton ashanin anton.asha...@gmail.com: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks!
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.comwrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: 回复: Socket does not have a channel
Hi Shashwat, As already mentioned already in my mail setting dfs.client.use.legacy.blockreader to true fixes the problem. This looks to be workaround or moreover disabling a feature. Would like to know, what is the exact problem? Cheers, Subroto Sanyal On Mar 5, 2013, at 6:33 PM, shashwat shriparv wrote: Try setting dfs.client.use.legacy.blockreader to true ∞ Shashwat Shriparv On Tue, Mar 5, 2013 at 8:39 PM, 卖报的小行家 85469...@qq.com wrote: Yes.It's from hadoop 2.0. I just now read the code 1.1.1.There are no such classes the log mentioned.Maybe you can read the code first. -- 原始邮件 -- 发件人: Subrotossan...@datameer.com; 发送时间: 2013年3月5日(星期二) 晚上10:56 收件人: useruser@hadoop.apache.org; 主题: Re: Socket does not have a channel Hi Julian, This is from CDH4.1.2 and I think its based on Apache Hadoop 2.0. Cheers, Subroto Sanyal On Mar 5, 2013, at 3:50 PM, 卖报的小行家 wrote: Hi, Which revision of hadoop? and what's the situation to report the Exception? BRs//Julian -- Original -- From: Subrotossan...@datameer.com; Date: Tue, Mar 5, 2013 04:46 PM To: useruser@hadoop.apache.org; Subject: Socket does not have a channel Hi java.lang.IllegalStateException: Socket Socket[addr=/10.86.203.112,port=1004,localport=35170] does not have a channel at com.google.common.base.Preconditions.checkState(Preconditions.java:172) at org.apache.hadoop.net.SocketInputWrapper.getReadableByteChannel(SocketInputWrapper.java:83) at org.apache.hadoop.hdfs.RemoteBlockReader2.newBlockReader(RemoteBlockReader2.java:432) at org.apache.hadoop.hdfs.BlockReaderFactory.newBlockReader(BlockReaderFactory.java:82) at org.apache.hadoop.hdfs.DFSInputStream.getBlockReader(DFSInputStream.java:832) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:444) While accessing the HDFS I keep getting the above mentioned error. Setting the dfs.client.use.legacy.blockreader to true fixes the problem. I would like to know what exactly is the problem? Is it a problem/bug in hadoop ? Is there is JIRA ticket for this?? Cheers, Subroto Sanyal
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Moving to cdh-user, user@hadoop in BCC Anton, can you just with the IP you have and see if it fixed the issue before trying anything else? JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Jean, thanks for trying to help. I get my IP address by DHCP. Every time I start my Ubuntu I possibly can get a different IP address from my WiFi modem /router. Will it be ok to add static address from 192.168.*.* to /etc/hosts in this case? On Tue, Mar 5, 2013 at 9:47 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Can you try to add something like: your.local.ip.addressyourhostname into your hosts file? Like: 192.168.1.2 masterserver 2013/3/5 anton ashanin anton.asha...@gmail.com: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks!
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.comwrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.comwrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: basic question about rack awareness and computation migration
Hello , To be precise this is hidden from the developer and you need not write any code for this. Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode. So essentially whenever a file is accessed via DFS Client it requests the NameNode for metadata, which DFS client uses to provide the file in streaming fashion to enduser. Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data. Thanks Rohit Kochar On 05-Mar-2013, at 5:19 PM, Julian Bui wrote: Hi hadoop users, I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden. I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg files. Input: list of paths Output: compressed jpeg Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them. As I understand it, each image will likely be on multiple data nodes. My question is how will each mapper task migrate the computation to the data nodes? I recall reading that the namenode is supposed to deal with this. Is it hidden from the developer? Or as the developer, do I need to discover where the data lies and then migrate the task to that node? Since my input is just a list of paths, it seems like the namenode couldn't really do this for me. Another question: Where can I find out more about this? I've looked up rack awareness and computation migration but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this. Anyway, could someone please help me out or set me straight on this? Thanks, -Julian
Re: How to setup Cloudera Hadoop to run everything on a localhost?
I am at a loss. I have set an IP address that my node got by DHCP: 127.0.0.1 localhost 192.168.1.6node This has not helped. Cloudera Manager finds this host all right, but still can not get a heartbeat from it next. Maybe the problem is that at the moment of these experiments I have three laptops with addresses assigned by DHCP all running at once? To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I try something else? Please let me know on what Linux version you have managed to run Hadoop on a local host only? On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Here is what my host is looking like: 127.0.0.1 localhost 192.168.1.2myserver JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Morgan, Just did exactly as you suggested, my /etc/hosts: 127.0.1.1 node.domain.local node Wiped out, annihilated my previous installation completely and reinstalled everything from scratch. The same problem with CLOUDERA MANAGER (FREE EDITION): Installation failed. Failed to receive heartbeat from agent I will try now the the bright idea from Jean, looks promising to me On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com wrote: Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.com wrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.com wrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Hi Anton, Cloudera manager needs fully qualified domain name. Run hostname -f to check whether you have FQDN or not. I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into /etc/sysconfig/network, which then looks like the following: NETWORKING=yes HOSTNAME=myhost.my.domain GATEWAY=10.2.2.254 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote: I am at a loss. I have set an IP address that my node got by DHCP: 127.0.0.1 localhost 192.168.1.6node This has not helped. Cloudera Manager finds this host all right, but still can not get a heartbeat from it next. Maybe the problem is that at the moment of these experiments I have three laptops with addresses assigned by DHCP all running at once? To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I try something else? Please let me know on what Linux version you have managed to run Hadoop on a local host only? On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Here is what my host is looking like: 127.0.0.1 localhost 192.168.1.2myserver JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Morgan, Just did exactly as you suggested, my /etc/hosts: 127.0.1.1 node.domain.local node Wiped out, annihilated my previous installation completely and reinstalled everything from scratch. The same problem with CLOUDERA MANAGER (FREE EDITION): Installation failed. Failed to receive heartbeat from agent I will try now the the bright idea from Jean, looks promising to me On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com wrote: Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.com wrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.com wrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Do you run all Hadoop servers on a single host that gets IP by DHCP? What do you have in /etc/hosts? Thanks! On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi yibing@effectivemeasure.comwrote: Hi Anton, Cloudera manager needs fully qualified domain name. Run hostname -f to check whether you have FQDN or not. I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into /etc/sysconfig/network, which then looks like the following: NETWORKING=yes HOSTNAME=myhost.my.domain GATEWAY=10.2.2.254 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote: I am at a loss. I have set an IP address that my node got by DHCP: 127.0.0.1 localhost 192.168.1.6node This has not helped. Cloudera Manager finds this host all right, but still can not get a heartbeat from it next. Maybe the problem is that at the moment of these experiments I have three laptops with addresses assigned by DHCP all running at once? To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I try something else? Please let me know on what Linux version you have managed to run Hadoop on a local host only? On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Here is what my host is looking like: 127.0.0.1 localhost 192.168.1.2myserver JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Morgan, Just did exactly as you suggested, my /etc/hosts: 127.0.1.1 node.domain.local node Wiped out, annihilated my previous installation completely and reinstalled everything from scratch. The same problem with CLOUDERA MANAGER (FREE EDITION): Installation failed. Failed to receive heartbeat from agent I will try now the the bright idea from Jean, looks promising to me On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com wrote: Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.com wrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.com wrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: How to setup Cloudera Hadoop to run everything on a localhost?
I didn't run all the services on a single server, but I doesn't matter since the installation is the same no matter how many servers you are going to install on. I got the same error as you and it turned out that CM needs to be able to know the FQDN. But I didn't use DHCP so it is easier for me to fix that. I guess you might have to set up the DHCP server correctly for CM to find your FQDN. http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf On Wed, Mar 6, 2013 at 9:56 AM, anton ashanin anton.asha...@gmail.comwrote: Do you run all Hadoop servers on a single host that gets IP by DHCP? What do you have in /etc/hosts? Thanks! On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi yibing@effectivemeasure.com wrote: Hi Anton, Cloudera manager needs fully qualified domain name. Run hostname -f to check whether you have FQDN or not. I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into /etc/sysconfig/network, which then looks like the following: NETWORKING=yes HOSTNAME=myhost.my.domain GATEWAY=10.2.2.254 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote: I am at a loss. I have set an IP address that my node got by DHCP: 127.0.0.1 localhost 192.168.1.6node This has not helped. Cloudera Manager finds this host all right, but still can not get a heartbeat from it next. Maybe the problem is that at the moment of these experiments I have three laptops with addresses assigned by DHCP all running at once? To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I try something else? Please let me know on what Linux version you have managed to run Hadoop on a local host only? On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Here is what my host is looking like: 127.0.0.1 localhost 192.168.1.2myserver JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Morgan, Just did exactly as you suggested, my /etc/hosts: 127.0.1.1 node.domain.local node Wiped out, annihilated my previous installation completely and reinstalled everything from scratch. The same problem with CLOUDERA MANAGER (FREE EDITION): Installation failed. Failed to receive heartbeat from agent I will try now the the bright idea from Jean, looks promising to me On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com wrote: Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.com wrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.com wrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/
Re: How to setup Cloudera Hadoop to run everything on a localhost?
Do the problem of installing Hadoop on a single DHCP node exist for Apache distribution of Hadoop as well? On Wed, Mar 6, 2013 at 2:30 AM, Suresh Srinivas sur...@hortonworks.comwrote: folks, another gentle reminder. Please use cloudera lists. On Tue, Mar 5, 2013 at 2:56 PM, anton ashanin anton.asha...@gmail.comwrote: Do you run all Hadoop servers on a single host that gets IP by DHCP? What do you have in /etc/hosts? Thanks! On Wed, Mar 6, 2013 at 1:25 AM, yibing Shi yibing@effectivemeasure.com wrote: Hi Anton, Cloudera manager needs fully qualified domain name. Run hostname -f to check whether you have FQDN or not. I am not familiar with Ubuntu, but on my CentOS, I just put the FQDN into /etc/sysconfig/network, which then looks like the following: NETWORKING=yes HOSTNAME=myhost.my.domain GATEWAY=10.2.2.254 http://demo.effectivemeasure.com/signatures/au/YibingShi.vcf On Wed, Mar 6, 2013 at 8:14 AM, anton ashanin anton.asha...@gmail.comwrote: I am at a loss. I have set an IP address that my node got by DHCP: 127.0.0.1 localhost 192.168.1.6node This has not helped. Cloudera Manager finds this host all right, but still can not get a heartbeat from it next. Maybe the problem is that at the moment of these experiments I have three laptops with addresses assigned by DHCP all running at once? To make Hadoop work I am ready now to switch Ubuntu for CentOS or should I try something else? Please let me know on what Linux version you have managed to run Hadoop on a local host only? On Tue, Mar 5, 2013 at 10:54 PM, Jean-Marc Spaggiari jean-m...@spaggiari.org wrote: Hi Anton, Here is what my host is looking like: 127.0.0.1 localhost 192.168.1.2myserver JM 2013/3/5 anton ashanin anton.asha...@gmail.com: Morgan, Just did exactly as you suggested, my /etc/hosts: 127.0.1.1 node.domain.local node Wiped out, annihilated my previous installation completely and reinstalled everything from scratch. The same problem with CLOUDERA MANAGER (FREE EDITION): Installation failed. Failed to receive heartbeat from agent I will try now the the bright idea from Jean, looks promising to me On Tue, Mar 5, 2013 at 10:10 PM, Morgan Reece winter2...@gmail.com wrote: Don't use 'localhost' as your host name. For example, if you wanted to use the name 'node'; add another line to your hosts file like: 127.0.1.1 node.domain.local node Then change all the host references in your configuration files to 'node' -- also, don't forget to change the master/slave files as well. Now, if you decide to use an external address it would need to be static. This is easy to do, just follow this guide http://www.howtoforge.com/linux-basics-set-a-static-ip-on-ubuntu and replace '127.0.1.1' with whatever external address you decide on. On Tue, Mar 5, 2013 at 12:59 PM, Suresh Srinivas sur...@hortonworks.com wrote: Can you please take this Cloudera mailing list? On Tue, Mar 5, 2013 at 10:33 AM, anton ashanin anton.asha...@gmail.com wrote: I am trying to run all Hadoop servers on a single Ubuntu localhost. All ports are open and my /etc/hosts file is 127.0.0.1 frigate frigate.domain.locallocalhost # The following lines are desirable for IPv6 capable hosts ::1 ip6-localhost ip6-loopback fe00::0 ip6-localnet ff00::0 ip6-mcastprefix ff02::1 ip6-allnodes ff02::2 ip6-allrouters When trying to install cluster Cloudera manager fails with the following messages: Installation failed. Failed to receive heartbeat from agent. I run my Ubuntu-12.04 host from home connected by WiFi/dialup modem to my provider. What configuration is missing? Thanks! -- http://hortonworks.com/download/ -- http://hortonworks.com/download/
Re: basic question about rack awareness and computation migration
Hi Rohit, Thanks for responding. a task can be scheduled by hadoop to be executed on the same node which is having data. In my case, the mapper won't actually know where the data resides at the time of being scheduled. It only knows what data it will be accessing when it reads in the keys. In other words, the task will be already be running by the time the mapper figures out what data must be accessed - so how can hadoop know where to execute the code? I'm still lost. Please help if you can. -Julian On Tue, Mar 5, 2013 at 11:15 AM, Rohit Kochar mnit.ro...@gmail.com wrote: Hello , To be precise this is hidden from the developer and you need not write any code for this. Whenever any file is stored in HDFS than it is splitted into block size of configured size and each block could potentially be stored on different datanode.All this information of which file contains which blocks resides with the namenode. So essentially whenever a file is accessed via DFS Client it requests the NameNode for metadata, which DFS client uses to provide the file in streaming fashion to enduser. Since namenode knows the location of all the blocks/files ,a task can be scheduled by hadoop to be executed on the same node which is having data. Thanks Rohit Kochar On 05-Mar-2013, at 5:19 PM, Julian Bui wrote: Hi hadoop users, I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden. I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg files. Input: list of paths Output: compressed jpeg Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them. As I understand it, each image will likely be on multiple data nodes. My question is how will each mapper task migrate the computation to the data nodes? I recall reading that the namenode is supposed to deal with this. Is it hidden from the developer? Or as the developer, do I need to discover where the data lies and then migrate the task to that node? Since my input is just a list of paths, it seems like the namenode couldn't really do this for me. Another question: Where can I find out more about this? I've looked up rack awareness and computation migration but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this. Anyway, could someone please help me out or set me straight on this? Thanks, -Julian
Re: basic question about rack awareness and computation migration
Your concern is correct: If your input is a list of files, rather than the files themselves, then the tasks would not be data-local - since the task input would just be the list of files, and the files' data may reside on any node/rack of the cluster. However, your job will still run as the HDFS reads do remote reads transparently without developer intervention and all will still work as you've written it to. If a block is found local to the DN, it is read locally as well - all of this is automatic. Are your input lists big (for each compressed output)? And is the list arbitrary or a defined list per goal? On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui julian...@gmail.com wrote: Hi hadoop users, I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden. I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg files. Input: list of paths Output: compressed jpeg Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them. As I understand it, each image will likely be on multiple data nodes. My question is how will each mapper task migrate the computation to the data nodes? I recall reading that the namenode is supposed to deal with this. Is it hidden from the developer? Or as the developer, do I need to discover where the data lies and then migrate the task to that node? Since my input is just a list of paths, it seems like the namenode couldn't really do this for me. Another question: Where can I find out more about this? I've looked up rack awareness and computation migration but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this. Anyway, could someone please help me out or set me straight on this? Thanks, -Julian -- Harsh J
Re: basic question about rack awareness and computation migration
Thanks Harsh, Are your input lists big (for each compressed output)? And is the list arbitrary or a defined list per goal? I dictate what my inputs will look like. If they need to be list of image files, then I can do that. If they need to be the images themselves as you suggest, then I can do that too but I'm not exactly sure what that would look like. Basically, I will try to format my inputs in the way that makes the most sense from a locality point of view. Since all the keys must be writable, I explored the Writable interface and found the interesting sub-classes: - FileSplit - BlockLocation - BytesWritable These all look somewhat promising as they kind of reveal the location information of the files. I'm not exactly sure how I would use these to hint at the data locations. Since these chunks of the file appear to be somewhat arbitrary in size and offset, I don't know how I could perform imagery operations on them. For example, if I knew that bytes 0x100-0x400 lie on node X, then that makes it difficult for me to use that information to give to my image libraries - does 0x100-0x400 correspond to some region/MBR within the image? I'm not sure how to make use of this information. The responses I've gotten so far indicate to me that HDFS kind of does the computation migration for me but that I have to give it enough information to work with. If someone could point to some detailed reading about this subject that would be pretty helpful, as I just can't find the documentation for it. Thanks again, -Julian On Tue, Mar 5, 2013 at 5:39 PM, Harsh J ha...@cloudera.com wrote: Your concern is correct: If your input is a list of files, rather than the files themselves, then the tasks would not be data-local - since the task input would just be the list of files, and the files' data may reside on any node/rack of the cluster. However, your job will still run as the HDFS reads do remote reads transparently without developer intervention and all will still work as you've written it to. If a block is found local to the DN, it is read locally as well - all of this is automatic. Are your input lists big (for each compressed output)? And is the list arbitrary or a defined list per goal? On Tue, Mar 5, 2013 at 5:19 PM, Julian Bui julian...@gmail.com wrote: Hi hadoop users, I'm trying to find out if computation migration is something the developer needs to worry about or if it's supposed to be hidden. I would like to use hadoop to take in a list of image paths in the hdfs and then have each task compress these large, raw images into something much smaller - say jpeg files. Input: list of paths Output: compressed jpeg Since I don't really need a reduce task (I'm more using hadoop for its reliability and orchestration aspects), my mapper ought to just take the list of image paths and then work on them. As I understand it, each image will likely be on multiple data nodes. My question is how will each mapper task migrate the computation to the data nodes? I recall reading that the namenode is supposed to deal with this. Is it hidden from the developer? Or as the developer, do I need to discover where the data lies and then migrate the task to that node? Since my input is just a list of paths, it seems like the namenode couldn't really do this for me. Another question: Where can I find out more about this? I've looked up rack awareness and computation migration but haven't really found much code relating to either one - leading me to believe I'm not supposed to have to write code to deal with this. Anyway, could someone please help me out or set me straight on this? Thanks, -Julian -- Harsh J
RE: Hadoop cluster setup - could not see second datanode
Although Hadoop is designed and developed for distributed computing it can be run on a single node in pseudo distributed mode and with multiple data node on single machine . Developers often run multiple data nodes on single node to develop and test distributed features,data node behavior, Name node interaction with data node and for other reasons. Please go through following blog for same.. http://www.blogger.com/blogger.g?blogID=2277703965936900657#editor/target=post;postID=8231904039775612388 From: Robert Evans [ev...@yahoo-inc.com] Sent: Tuesday, March 05, 2013 11:57 PM To: user@hadoop.apache.org Subject: Re: Hadoop cluster setup - could not see second datanode Why would you need several data nodes? It is simple to have one data node and one name node on the same machine. I believe that you can make multiple data nodes run on the same machine, but it would take quite a bit of configuration work to do it, and it would only really be helpful for you to do some very specific testing involving multiple data nodes. --Bobby From: 卖报的小行家 85469...@qq.commailto:85469...@qq.com Reply-To: user@hadoop.apache.orgmailto:user@hadoop.apache.org user@hadoop.apache.orgmailto:user@hadoop.apache.org Date: Tuesday, March 5, 2013 8:41 AM To: user user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re:RE: Hadoop cluster setup - could not see second datanode Hello, Can Namenode and several datanodes exist in one machine? I only have one PC. I want to configure it like this way. BRs//Julian -- Original -- From: AMARNATH, Balachandarbalachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com; Date: Tue, Mar 5, 2013 07:55 PM To: user@hadoop.apache.orgmailto:user@hadoop.apache.orguser@hadoop.apache.orgmailto:user@hadoop.apache.org; Subject: RE: Hadoop cluster setup - could not see second datanode I fixed it the below issue :) Regards Bala From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 05 March 2013 17:05 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Hadoop cluster setup - could not see second datanode Thanks for the information, Now I am trying to install hadoop dfs using 2 nodes. A namenode cum datanode, and a separate data node. I use the following configuration for my hdfs-site.xml configuration property namefs.default.name/name valuelocalhost:9000/value /property property namedfs.data.dir/name value/home/bala/data/value /property property namedfs.name.dir/name value/home/bala/name/value /property /configuration In namenode, I have added the datanode hostnames (machine1 and machine2). When I do ‘start-all.sh’, I see the log that the data node is starting in both the machines but I went to the browser in the namenode, I see only one live node. (That is the namenode which is configured as datanode) Any hint here will help me With regards Bala From: Mahesh Balija [mailto:balijamahesh@gmail.com] Sent: 05 March 2013 14:15 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Re: Hadoop file system You can be able to use Hdfs alone in the distributed mode to fulfill your requirement. Hdfs has the Filesystem java api through which you can interact with the HDFS from your client. HDFS is good if you have less number of files with huge size rather than you having many files with small size. Best, Mahesh Balija, Calsoft Labs. On Tue, Mar 5, 2013 at 10:43 AM, AMARNATH, Balachandar balachandar.amarn...@airbus.commailto:balachandar.amarn...@airbus.com wrote: Hi, I am new to hdfs. In my java application, I need to perform ‘similar operation’ over large number of files. I would like to store those files in distributed machines. I don’t think, I will need map reduce paradigm. But however I would like to use HDFS for file storage and access. Is it possible (or nice idea) to use HDFS as a stand alone stuff? And, java APIs are available to work with HDFS so that I can read/write in distributed environment ? Any thoughts here will be helpful. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be
Map reduce technique
Hi, I am new to map reduce paradigm. I read in a tutorial that says that 'map' function splits the data and into key value pairs. This means, the map-reduce framework automatically splits the data into pieces or do we need to explicitly provide the method to split the data into pieces. If it does automatically, how it splits an image file (size etc)? I see, processing of an image file as a whole will give different results than processing them in chunks. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
Re: FileStatus.getPath
The FileStatus is a container of metadata for a specific path, and hence carries the Path object the rest of the details are for. What do you exactly mean by has no defined contract? If you want a qualified path (for a specific FS), then doing path.makeQualified(…) is always the right way. On Tue, Mar 5, 2013 at 11:31 PM, Jay Vyas jayunit...@gmail.com wrote: Hi it appears that: http://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileStatus.html getPath() has no defined contract. Why does FileStatus have a getPAth method? Would it be the equivalent effect to simply make a path qualified using the FileSystem object? i.e. path.makeQualified(FileSystem.get()) ? -- Jay Vyas http://jayunit100.blogspot.com -- Harsh J
Re: Map reduce technique
Hi Balachandar, In MapReduce, interpreting input files as key value pairs is accomplished through InputFormats. Some common InputFormats are TextInputFormat, which uses lines in a text file as values and their byte offset into the file as keys, KeyValueTextInputFormat, which interprets the first token on a line as the key and the rest as the value, and WholeFileInputFormat, which uses an entire line as a value. If you wanted to process an image file in a specific way, you would probably need to supply your own InputFormat. Does that help? -Sandy On Tue, Mar 5, 2013 at 9:37 PM, AMARNATH, Balachandar balachandar.amarn...@airbus.com wrote: Hi, I am new to map reduce paradigm. I read in a tutorial that says that ‘map’ function splits the data and into key value pairs. This means, the map-reduce framework automatically splits the data into pieces or do we need to explicitly provide the method to split the data into pieces. If it does automatically, how it splits an image file (size etc)? I see, processing of an image file as a whole will give different results than processing them in chunks. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
RE: Map reduce technique
I think you have to look the sequence file as input format . Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once. Regards, Samir. From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 06 March 2013 11:07 To: user@hadoop.apache.org Subject: Map reduce technique Hi, I am new to map reduce paradigm. I read in a tutorial that says that 'map' function splits the data and into key value pairs. This means, the map-reduce framework automatically splits the data into pieces or do we need to explicitly provide the method to split the data into pieces. If it does automatically, how it splits an image file (size etc)? I see, processing of an image file as a whole will give different results than processing them in chunks. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free.
RE: Map reduce technique
job.setInputFormatClass(SequenceFileInputFormat.class); Just you have to follow Hadoop API from apache web-site Hints: 1) Create sequence file prior to the Job.(Java Algorithm ) Example POC: You have to change based on your requirement import java.io.IOException; import java.net.URI; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IOUtils; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.SequenceFile; import org.apache.hadoop.io.Text; //White, Tom (2012-05-10). Hadoop: The Definitive Guide (Kindle Locations 5375-5384). OReilly Media - A. Kindle Edition. public class SequenceFileWriteDemo { private static final String[] DATA = { One, two, buckle my shoe, Three, four, shut the door, Five, six, pick up sticks, Seven, eight, lay them straight, Nine, ten, a big fat hen }; public static void main( String[] args) throws IOException { //local file path String uri = /home/hadoop/Desktop/Image/test_02.txt; Configuration conf = new Configuration(); FileSystem fs = FileSystem.get(URI.create( uri), conf); Path path = new Path( uri); IntWritable key = new IntWritable(); Text value = new Text(); SequenceFile.Writer writer = null; try { writer = SequenceFile.createWriter( fs, conf, path, key.getClass(), value.getClass()); for (int i = 0; i 100; i ++) { key.set( 100 - i); value.set( DATA[ i % DATA.length]); // System.out.printf([% s]\t% s\t% s\n, writer.getLength(), key, value); writer.append( key, value); } } finally { IOUtils.closeStream( writer); } } } Note: you have to convert all image file to one sequence file. 2) Put it into the HDFS 3) Write MAP/Reduce based on the logic what you need From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 06 March 2013 11:24 To: user@hadoop.apache.org Subject: RE: Map reduce technique Thanks for the mail, Can u please share few links to start with? Regards Bala From: Samir Kumar Das Mohapatra [mailto:dasmo...@adobe.com] Sent: 06 March 2013 11:21 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: RE: Map reduce technique I think you have to look the sequence file as input format . Basically, the way this works is, you will have a separate Java process that takes several image files, reads the ray bytes into memory, then stores the data into a key-value pair in a SequenceFile. Keep going and keep writing into HDFS. This may take a while, but you'll only have to do it once. Regards, Samir. From: AMARNATH, Balachandar [mailto:balachandar.amarn...@airbus.com] Sent: 06 March 2013 11:07 To: user@hadoop.apache.orgmailto:user@hadoop.apache.org Subject: Map reduce technique Hi, I am new to map reduce paradigm. I read in a tutorial that says that 'map' function splits the data and into key value pairs. This means, the map-reduce framework automatically splits the data into pieces or do we need to explicitly provide the method to split the data into pieces. If it does automatically, how it splits an image file (size etc)? I see, processing of an image file as a whole will give different results than processing them in chunks. With thanks and regards Balachandar The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over public networks. If you have any concerns over the content of this message or its Accuracy or Integrity, please contact Airbus immediately. All outgoing e-mails from Airbus are checked using regularly updated virus scanning software but you should take whatever measures you deem to be appropriate to ensure that this message and any attachments are virus free. The information in this e-mail is confidential. The contents may not be disclosed or used by anyone other than the addressee. Access to this e-mail by anyone else is unauthorised. If you are not the intended recipient, please notify Airbus immediately and delete this e-mail. Airbus cannot accept any responsibility for the accuracy or completeness of this e-mail as it has been sent over